Linux如何定位死机问题: CPU 0 Unable to handle kernel paging request at virtual address

Linux运行的时候崩溃死机了,打印如下:

CPU 0 Unable to handle kernel paging request at virtual address 0000000000000318, epc == ffffffffc0445a10, ra == ffffffffc04459dc
Oops[#1]:
Cpu 0
$ 0   : 0000000000000000 ffffffff808b1da0 0000000000000300 0000000000000030
$ 4   : 0000000000000000 a8000000029d2160 000000000000002e a800000002559000
$ 8   : a8000000029d2140 0000000000000001 0000000000000000 0000000000000018
$12   : 0000000000000000 000000001000001f a800000031180000 0000000000000000
$16   : a8000000029d214e 0000000000000300 a8000000012d1600 a8000000029c8580
$20   : a8000000012d1870 ffffffff812408e8 0000000000000806 0000000000000000
$24   : 00000000000002b1 000000555d5887b0                                  
$28   : ffffffff811c4000 ffffffff811c7970 ffffffff811c7970 ffffffffc04459dc
Hi    : 0000000000000000
Lo    : 0000000000000000
epc   : ffffffffc0445a10 rlb_arp_recv+0x128/0x228 [bonding]
    Tainted: P          
ra    : ffffffffc04459dc rlb_arp_recv+0xf4/0x228 [bonding]
Status: 1010cce3    KX SX UX KERNEL EXL IE 
Cause : 00800008
BadVA : 0000000000000318
PrId  : 000d9202 (Cavium Octeon II)
Modules linked in: bonding run(P) raid vscsih iscsitgt disk vdisk cache(P) service gmeta mpt2sas netlink bubble platform octeon_ethernet at24
Process swapper (pid: 0, threadinfo=ffffffff811c4000, task=ffffffff811e5280, tls=0000000000000000)
Stack : 0000000000000003 ffffffff81241498 ffffffff812414d8 a8000000029c8580
        a8000000029c8644 a800000002559000 ffffffff811c79b0 ffffffff807a7648
        000d0300000d0300 ffffffff808b22e0 000000000000003c a800000002559600
        a8000000029c8580 a800000002b7d280 0000000000000000 0000000000000001
        0000000000000001 0000000000000001 ffffffff811c7a10 ffffffffc0010154
        ffffffff811c7b80 ffffffff802d22e8 0000000000000000 ffffffff80356140
        0000000000000000 0000000000000000 8001670000000000 0000000000000001
        0000000000000003 0000000000000001 0000000000000000 000000000000ffff
        0000000000000000 ffffffffc001ac00 0000000000000020 000000011000001f
        a800000031180000 0000000000000000 ffffffff811d2a00 8001670000000100
        ...
Call Trace:
[] rlb_arp_recv+0x128/0x228 [bonding]
[] netif_receive_skb+0x3f0/0x4d8
[] cvm_oct_napi_poll_38+0x7ac/0x10e8 [octeon_ethernet]
[] net_rx_action+0x128/0x280
[] __do_softirq+0x130/0x248
[] do_softirq+0x88/0x90
[] irq_exit+0x70/0x88
[] do_IRQ+0x48/0x60
[] octeon_irq_ip2_ciu+0x94/0xb8
[] plat_irq_dispatch+0x80/0xd0
[] ret_from_irq+0x0/0x4
[] r4k_wait+0x20/0x40
[] cpu_idle+0x84/0xa0
[] rest_init+0x80/0x98
[] start_kernel+0x37c/0x4c4

Code: de440268  70431003  0082882d <92230018> 10600007  3c02808b  8a020018  8e230000  9a02001b 
Kernel panic - not syncing: Fatal exception in interrupt

*** NMI Watchdog interrupt on Core 0x01 ***
 $0 0x0000000000000000 at 0xffffffff803471bc
 v0 0xffffffff802d24c0 v1 0x0000000000000001
 a0 0xfffffffffffffffd a1 0x0000000000000000
 a2 0xffffffff812403c8 a3 0x0000000000000001
 a4 0x0000000000000800 a5 0x0000000000000020
 a6 0x0000000000000000 a7 0x000000aaab43b498
 t0 0x0000000000000000 t1 0x000000001000001f
 t2 0xa800000031188000 t3 0x0000000000000000
 s0 0xffffffff853e0000 s1 0xffffffff853f0000
 s2 0xffffffff811c8980 s3 0x0000000000000000
 s4 0x0000000000000002 s5 0x0000000000200200
 s6 0xffffffff811c8990 s7 0xffffffff811287d0
 t8 0x0000000000000000 t9 0x0000005561b7f7b0
 k0 0x0000000000000000 k1 0x0000000000000000
 gp 0xa8000000310fc000 sp 0xa8000000310ffb10
 s8 0xa8000000310ffb10 ra 0xffffffff802dbc18
 err_epc 0xffffffff802d24e0 epc 0xffffffff802d24e0
 status 0x000000001058cce4 cause 0x0000000040808800
 sum0 0x0000000000000000 en0 0x0000000000000000
*** Chip soft reset soon ***

重点在这里:

epc   : ffffffffc0445a10 rlb_arp_recv+0x128/0x228
Call Trace:
[] rlb_arp_recv+0x128/0x228 [bonding]

反汇编发生死机的ko模块

mips64-octeon-linux-gnu-objdump -S  bonding.ko

搜索 rlb_arp_recv的基址,并计算死机的位置:
000000000000e8e8 :

0xe8e8 + 0x128 = 0xea10

也就是说,正确的出错位置是     if ((client_info->assigned) &&

 _lock_rx_hashtbl(bond);

        hash_index = _simple_hash((u8*)&(arp->ip_src), sizeof(arp->ip_src));
        client_info = &(bond_info->rx_hashtbl[hash_index]);
    e9fc:       7c82f803        dext    v0,a0,0x0,0x20
    ea00:       24030030        li      v1,48
    ea04:       de440268        ld      a0,616(s2)
    ea08:       70431003        dmul    v0,v0,v1
    ea0c:       0082882d        daddu   s1,a0,v0

        if ((client_info->assigned) &&
    ea10:       92230018        lbu     v1,24(s1)
    ea14:       10600007        beqz    v1,ea34 
    ea18:       3c020000        lui     v0,0x0
    ea1c:       8a020018        lwl     v0,24(s0)
    ea20:       8e230000        lw      v1,0(s1)
    ea24:       9a02001b        lwr     v0,27(s0)
    ea28:       10620019        beq     v1,v0,ea90 
    ea2c:       00000000        nop
        spin_lock_bh(&(BOND_ALB_INFO(bond).rx_hashtbl_lock));

epc :exception program counter  , 异常程序计数器,  ra : return address 返回地址