how do you diagnose a kernel oops?
Given a linux kernel oops, how do you go about diagnosing the problem? In the output I can see a stack trace which seems to give some clues. Are there any tools that would help find the problem? What basic procedures do you follow to track it down?
Unable to handle kernel paging request for data at address 0x33343a31 Faulting instruction address: 0xc50659ec Oops: Kernel access of bad area, sig: 11 [#1] tpsslr3 Modules linked in: datalog(P) manet(P) vnet wlan_wep wlan_scan_sta ath_rate_sample ath_pci wlan ath_hal(P) NIP: c50659ec LR: c5065f04 CTR: c00192e8 REGS: c2aff920 TRAP: 0300 Tainted: P (220.127.116.11-dirty) MSR: 00009032 CR: 22082444 XER: 20000000 DAR: 33343a31, DSISR: 20000000 TASK = c2e6e3f0 'datalogd' THREAD: c2afe000 GPR00: c5065f04 c2aff9d0 c2e6e3f0 00000000 00000001 00000001 00000000 0000b3f9 GPR08: 3a33340a c5069624 c5068d14 33343a31 82082482 1001f2b4 c1228000 c1230000 GPR16: c60f0000 000004a8 c59abbe6 0000002f c1228360 c340d6b0 c5070000 00000001 GPR24: c2aff9e0 c5070000 00000000 00000000 00000003 c2cc2780 c2affae8 0000000f NIP [c50659ec] mesh_packet_in+0x3d8/0xdac [manet] LR [c5065f04] mesh_packet_in+0x8f0/0xdac [manet] Call Trace: [c2aff9d0] [c5065f04] mesh_packet_in+0x8f0/0xdac [manet] (unreliable) [c2affad0] [c5061ff8] IF_netif_rx+0xa0/0xb0 [manet] [c2affae0] [c01925e4] netif_receive_skb+0x34/0x3c4 [c2affb10] [c60b5f74] netif_receive_skb_debug+0x2c/0x3c [wlan] [c2affb20] [c60bc7a4] ieee80211_deliver_data+0x1b4/0x380 [wlan] [c2affb60] [c60bd420] ieee80211_input+0xab0/0x1bec [wlan] [c2affbf0] [c6105b04] ath_rx_poll+0x884/0xab8 [ath_pci] [c2affc90] [c018ec20] net_rx_action+0xd8/0x1ac [c2affcb0] [c00260b4] __do_softirq+0x7c/0xf4 [c2affce0] [c0005754] do_softirq+0x58/0x5c [c2affcf0] [c0025eb4] irq_exit+0x48/0x58 [c2affd00] [c000627c] do_IRQ+0xa4/0xc4 [c2affd10] [c00106f8] ret_from_except+0x0/0x14 --- Exception: 501 at __delay+0x78/0x98 LR = cfi_amdstd_write_buffers+0x618/0x7ac [c2affdd0] [c0163670] cfi_amdstd_write_buffers+0x504/0x7ac (unreliable) [c2affe50] [c015a2d0] concat_write+0xe4/0x140 [c2affe80] [c0158ff4] part_write+0xd0/0xf0 [c2affe90] [c015bdf0] mtd_write+0x170/0x2a8 [c2affef0] [c0073898] vfs_write+0xcc/0x16c [c2afff10] [c0073f2c] sys_write+0x4c/0x90 [c2afff40] [c0010060] ret_from_syscall+0x0/0x38 --- Exception: c01 at 0xfd98a50 LR = 0x10003840 Instruction dump: 419d02a0 98010009 800100a4 2f800003 419e0508 2f170000 419a0098 3d20c507 a0e1002e 81699624 39299624 7f8b4800 419e007c a0610016 7d264b78 Kernel panic - not syncing: Fatal exception in interrupt Rebooting in 1 seconds..
An Oops gives a bunch of information useful in diagnosing a crash. It starts with the address of the crash, the reason ("access of bad area") and the contents of the registers. The call trace answers the question "how did we get here". The first item in the list happened most recently. Working backwards, an interrupt happened (do_IRQ) because the Atheros WiFi adapter received a packet (ath_rx_poll). The routine passed it to the generic WiFi code (ieee80211_input) which in turn passed it up to the network stack (netif_receive_skb).
To figure out the exact code causing the problem, you can run
and then disassemble the function in question, which might be mesh_packet_in(). Might, because the faulting instruction (0xc50659ec) looks to be outside of mesh_packet_in() (0xc5065f04). You might also try the gdb command
(gdb) info line 0xc50659ec
to figure out which function contains this address.
You should first try to find the source of the code that has crashed. In the specific case, the analysis claims that the crash happened in mesh_packet_in of the manet driver, at offset 0x8f0. It also reports that the instructions at this point are 419d02a0 98010009 ... So inspect the module with "objdump -d", to confirm whether the function/offset reported is correct. Then check the source for what it is doing; you can use the registers list to confirm again that you are looking at the right instruction.
When you know what C statement is faulting, you need to read the source to find out where the bogus data were coming from.
Install this into your kernel, then when it Oops's, you'll be thrown into a gdb-like interface that you can poke around with. However, it looks like the manet module is deref'ing a bad pointer.