# Seemingly random kernel crash

## Freman

Hi all,

Didn't know where to go with this one so I figured I'd start here.

Ever since I upgraded my hardware (and kernel at the same time, figured it was about due) I've been getting random crash/reboots.

I've finally managed to capture one of these crashes with the magic of netconsole which I've pasted below.

There seems to be no rhyme or reason as to when and why - it'll stay up for weeks, than crash twice in 12 hours...

Any ideas?

```
------------[ cut here ]------------

kernel BUG at kernel/timer.c:866!

invalid opcode: 0000 [#1] SMP

last sysfs file: /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

CPU 2

Modules linked in: netconsole nfsd pppoe pppox ppp_generic slhc bridge stp llc xt_mac ipt_addrtype xt_dscp xt_string ipt_set xt_owner xt_multiport xt_iprange xt_hashlimit xt_DSCP ipt_SET xt_NFQUEUE xt_connmark ip_set_iphash ip_set_iptree ip_set xfs exportfs ath5k mac80211 ath cfg80211 ftdi_sio r8169 usbserial i2c_piix4 k10temp

Pid: 0, comm: kworker/0:1 Tainted: G   M        2.6.36-gentoo-r5 #6 GA-MA770T-UD3P/GA-MA770T-UD3P

RIP: 0010:[<ffffffff8104310f>]  [<ffffffff8104310f>] add_timer+0x6/0x13

RSP: 0018:ffff880001903e88  EFLAGS: 00010282

RAX: 000000010c628afa RBX: ffff88011f690000 RCX: 000000010c628b02

RDX: 0000000000000287 RSI: ffffffff81050eb1 RDI: ffff880118f00df0

RBP: 0000000000000100 R08: 0002f42e1a49a400 R09: 00000000000000fa

R10: 0000000000000000 R11: ffffffff8101e3c9 R12: ffffffffa00893d0

R13: ffff88011f695fd8 R14: ffff880001903eb0 R15: ffff88011f695fd8

FS:  00007ff88fdfd710(0000) GS:ffff880001900000(0000) knlGS:00000000f75bf8d0

CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b

CR2: 00000000f25aa000 CR3: 0000000103d80000 CR4: 00000000000006e0

DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

Process kworker/0:1 (pid: 0, threadinfo ffff88011f694000, task ffff88011f655a20)

Stack:

 ffffffff810426c3 ffff88011f691c20 ffff88011f691820 ffff88011f691420

<0> ffff88011f691020 ffff880001903eb0 ffff880001903eb0 ffffffff815c8500

<0> 0000000000000001 0000000000000008 0000000000000100 0000000000000141

Call Trace:

 <IRQ>

 [<ffffffff810426c3>] ? run_timer_softirq+0x16c/0x1f7

 [<ffffffff8103db93>] ? __do_softirq+0x8b/0x108

 [<ffffffff8100374c>] ? call_softirq+0x1c/0x28

 [<ffffffff81004ae1>] ? do_softirq+0x31/0x63

 [<ffffffff8103d8d3>] ? irq_exit+0x36/0x7a

 [<ffffffff81004256>] ? do_IRQ+0xa7/0xbd

 [<ffffffff81426e53>] ? ret_from_intr+0x0/0xa

 <EOI>

 [<ffffffff81008b40>] ? default_idle+0x20/0x34

 [<ffffffff81008b40>] ? default_idle+0x20/0x34

 [<ffffffff81008d9a>] ? c1e_idle+0xcd/0xe7

 [<ffffffff81001c6c>] ? cpu_idle+0x57/0x8d

Code: 48 89 df e8 cf f1 ff ff 48 8b 74 24 08 48 89 df e8 0b 3a 3e 00 41 5a 41 5b 5b 5d 41 5c 41 5d 44 89 f0 41 5e c3 48 83 3f 00 74 04 <0f> 0b eb fe 48 8b 77 10 e9 62 fe ff ff 41 56 89 f0 c1 e8 08 41

RIP  [<ffffffff8104310f>] add_timer+0x6/0x13

 RSP <ffff880001903e88>

---[ end trace a36ebc2abcff10ef ]---

Kernel panic - not syncing: Fatal exception in interrupt

Pid: 0, comm: kworker/0:1 Tainted: G   M  D     2.6.36-gentoo-r5 #6

Call Trace:

 <IRQ>  [<ffffffff81424790>] ? panic+0x9d/0x1a5

 [<ffffffff81426e53>] ? ret_from_intr+0x0/0xa

 [<ffffffff81039a6d>] ? kmsg_dump+0x9b/0x127

 [<ffffffff81427b5f>] ? oops_end+0x9f/0xac

 [<ffffffffa00893d0>] ? sta_info_cleanup+0x0/0x155 [mac80211]

 [<ffffffff81003d3f>] ? do_invalid_op+0x85/0x8f

 [<ffffffff8104310f>] ? add_timer+0x6/0x13

 [<ffffffffa01831d9>] ? br_dev_queue_push_xmit+0x75/0x79 [bridge]

 [<ffffffff810034d5>] ? invalid_op+0x15/0x20

 [<ffffffffa00893d0>] ? sta_info_cleanup+0x0/0x155 [mac80211]

 [<ffffffff8101e3c9>] ? hpet_legacy_next_event+0x0/0x7

 [<ffffffff81050eb1>] ? hrtimer_interrupt+0xe3/0x18c

 [<ffffffff8104310f>] ? add_timer+0x6/0x13

 [<ffffffff810426c3>] ? run_timer_softirq+0x16c/0x1f7

 [<ffffffff8103db93>] ? __do_softirq+0x8b/0x108

 [<ffffffff8100374c>] ? call_softirq+0x1c/0x28

 [<ffffffff81004ae1>] ? do_softirq+0x31/0x63

 [<ffffffff8103d8d3>] ? irq_exit+0x36/0x7a

 [<ffffffff81004256>] ? do_IRQ+0xa7/0xbd

 [<ffffffff81426e53>] ? ret_from_intr+0x0/0xa

 <EOI>  [<ffffffff81008b40>] ? default_idle+0x20/0x34

 [<ffffffff81008b40>] ? default_idle+0x20/0x34

 [<ffffffff81008d9a>] ? c1e_idle+0xcd/0xe7

 [<ffffffff81001c6c>] ? cpu_idle+0x57/0x8d

Rebooting in 3 seconds..
```

----------

## WvR

Looks like something is happening with cpufreqd, apparently it is trying to do something which is not proper ("invalid opcode"). I would say:

- re-emerge cpufreqd just to be sure

- make sure that the config file of cpufreqd is OK

- restart cpufreqd

Also, when moving to a new kernel, make sure to set the correct links using

```
eselect kernel list
```

 and 

```
eselect kernel set N
```

.

Some programs rely on header files from the kernel source, so if you compile a program for kernel-X but your computer thinks it is still using kernel-Y then strange things may happen.

----------

## Freman

Thanks for the info

I don't have cpufreqd installed.

I was running cpudyn but I stopped that after the most recent crash to see if that was the cause (considering where the crash was happening)

I've re-merged cpufrequtils just in case I forgot to when I upgraded kernels in December.

Switched to the ondemand governor just to rule out external tools

The symlink is correctly configured to my kernel.

Any more suggestions or shall I leave it at that for a while to see what happens?

Edit: Just thought of something, we have revdep-rebuild to find things that are dependant on libraries that have been upgraded/broken, is there something similar for kernel dependancies?

----------

## WvR

I thought cpufreqd because of the "cpufreq" in the last sysfile mentioned, but indeed this is a generic name. I would recommend to review the kernel settings which have to do with frequency scaling and re-emerge related packages. The error could also be bad hardware, maybe.

I don't know of there is a special tool to detect an erroneous setting of the /usr/src/linux symlink. Usually a package will complain when the configure script is run, usually with something like "inconsistent headers and kernel version".

----------

## Neo2

According to the function names in the stack trace it seems to be a problem with powersaving ("c1e_idle"), interrupt handling ("ret_from_intr", "do_IRQ") and maybe the wireless card ("sta_info_cleanup", "br_dev_queue_push_xmit").

The bug happens in kernel/timer.c line 866, which references this function on 2.6.37. I think it is correct though, because at line 866 we have a BUG_ON statement:

```
/**

 * add_timer - start a timer

 * @timer: the timer to be added

 *

 * The kernel will do a ->function(->data) callback from the

 * timer interrupt at the ->expires point in the future. The

 * current time is 'jiffies'.

 *

 * The timer's ->expires, ->function (and if the handler uses it, ->data)

 * fields must be set prior calling this function.

 * 

 * Timers with an ->expires field in the past will be executed in the next

 * timer tick.

 */

void add_timer(struct timer_list *timer)

{

        BUG_ON(timer_pending(timer));

        mod_timer(timer, timer->expires);

}

EXPORT_SYMBOL(add_timer);
```

which means that if there is already a pending timer on the timer passed to the function, the kernel will raise a BUG_ON. The kernel then panics because it can't handle correctly an interrupt.

Your motherboard seems to be well tested (AMD SB700 chipset) and has been on the market for some time now. The CPU should be an AM3 one, thus an Athlon II or Phenom II.

Ensure that your BIOS is up to date and try disabling C1E power saving feature from there.

If that doesn't work, what is the output of "lspci -nn"?

What is your current kernel config? Did you configure your kernel manually or did you use genkernel to build it?

Maybe it is a kernel or driver bug that has been solved in 2.6.37. Would you mind trying to upgrade?

Cheers,

Neo2

----------

## Freman

Hmmm wifi... quite possible.

Especially as either hostapd or the driver seem to be dodgy and my box keeps getting hammered by an intel wifi card trying to connect...

Hmmm BIOS... according to dmidecode -s bios-version it's F3 according to the gigabyte website there's a F11C which is a Beta BIOS - with the number 2 note being "Fix C1E and LAN compatibility issue" - So I'll look at upgrading that when I get home from work tonight (Don't suppose there exists a tool to upgrade from linux so I don't have to mess around with making something dosish bootable and plug in a video card... and a monitor... eaugh...)

```
00:00.0 Host bridge [0600]: ATI Technologies Inc RX780/RX790 Chipset Host Bridge [1002:5957]

00:06.0 PCI bridge [0604]: ATI Technologies Inc RD790 PCI to PCI bridge (PCI express gpp port C) [1002:597c]

00:0a.0 PCI bridge [0604]: ATI Technologies Inc RD790 PCI to PCI bridge (PCI express gpp port F) [1002:597f]

00:11.0 SATA controller [0106]: ATI Technologies Inc SB700/SB800 SATA Controller [AHCI mode] [1002:4391]

00:12.0 USB Controller [0c03]: ATI Technologies Inc SB700/SB800 USB OHCI0 Controller [1002:4397]

00:12.1 USB Controller [0c03]: ATI Technologies Inc SB700 USB OHCI1 Controller [1002:4398]

00:12.2 USB Controller [0c03]: ATI Technologies Inc SB700/SB800 USB EHCI Controller [1002:4396]

00:13.0 USB Controller [0c03]: ATI Technologies Inc SB700/SB800 USB OHCI0 Controller [1002:4397]

00:13.1 USB Controller [0c03]: ATI Technologies Inc SB700 USB OHCI1 Controller [1002:4398]

00:13.2 USB Controller [0c03]: ATI Technologies Inc SB700/SB800 USB EHCI Controller [1002:4396]

00:14.0 SMBus [0c05]: ATI Technologies Inc SBx00 SMBus Controller [1002:4385] (rev 3c)

00:14.1 IDE interface [0101]: ATI Technologies Inc SB700/SB800 IDE Controller [1002:439c]

00:14.3 ISA bridge [0601]: ATI Technologies Inc SB700/SB800 LPC host controller [1002:439d]

00:14.4 PCI bridge [0604]: ATI Technologies Inc SBx00 PCI to PCI Bridge [1002:4384]

00:14.5 USB Controller [0c03]: ATI Technologies Inc SB700/SB800 USB OHCI2 Controller [1002:4399]

00:18.0 Host bridge [0600]: Advanced Micro Devices [AMD] K10 [Opteron, Athlon64, Sempron] HyperTransport Configuration [1022:1200]

00:18.1 Host bridge [0600]: Advanced Micro Devices [AMD] K10 [Opteron, Athlon64, Sempron] Address Map [1022:1201]

00:18.2 Host bridge [0600]: Advanced Micro Devices [AMD] K10 [Opteron, Athlon64, Sempron] DRAM Controller [1022:1202]

00:18.3 Host bridge [0600]: Advanced Micro Devices [AMD] K10 [Opteron, Athlon64, Sempron] Miscellaneous Control [1022:1203]

00:18.4 Host bridge [0600]: Advanced Micro Devices [AMD] K10 [Opteron, Athlon64, Sempron] Link Control [1022:1204]

01:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller [10ec:8168] (rev 01)

02:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller [10ec:8168] (rev 02)

03:06.0 Ethernet controller [0200]: Atheros Communications Inc. AR5007G Wireless Network Adapter [168c:001d] (rev 01)

03:07.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit Ethernet [10ec:8169] (rev 10)

03:0e.0 FireWire (IEEE 1394) [0c00]: Texas Instruments TSB43AB23 IEEE-1394a-2000 Controller (PHY/Link) [104c:8024]

```

Edit: found flashboot - didn't help me with my monitor tho, still had to plug a video card and monitor in to reset the bios and fix the bootrom on my nic (which flashboot seems to had reset to on)

Seems silly in this day and age that we can't reliably change bios settings from the os...

So, we'll run this for a while, see how it goes - hopefully no more crashes

----------

