# Need some help diagnosing an oops [solved]

## bpkent

I wish I knew where to start, for me the beginning was Dec 31st 2011 and what I thought would be a straightforward kernel upgrade from gentoo-sources 3.0.6 to 3.1.6 (x86_64).  I followed what are for me the usual steps (emerge -uvDN --with-bdep y world, revdep-rebuild -vi, copy previous .config, make menuconfig (esc, esc, save), genkernel, module-rebuild rebuild, update grub configuration, reboot).  The ones I've been following since the initial install at 2.6.31-r10 back in April 2010.

Up came KDE, all seemed well and I started running my apps.  Mouse locked up, screen went black (except for the mouse pointer), keyboard unresponsive, no reply to pings from another machine.  Oh heck, reset, revert to 3.0.6, and back to stability while my md raid10 devices resynced.  Take a look at /var/log/messages, unfortunately the tail end of it appears corrupted, nothing to indicate what happened.

I got into such a panic about what I saw with mdadm --detail I joined the md mail list and asked some newbie questions to make sure my data was safe.  As a further precaution I set up another machine to rsync said data.  I got a reply from the md list that left me reasonably confident there was nothing troublesome with my raid10 config.  On the plus side I had a second machine, with pretty much the same software stack and the same data on which to do some testing .  Kernel 3.1.6, KDE, vmware, it all got installed, start up, go through the same set of operations, waiting for the thing to fail.  It refuses to follow the same pattern.  OK, maybe it was a one off, boot primary machine to 3.1.6 again.  Oh heck, the same lock up as the first time, and another load of garbage at the end of the messages file

Then 3.1.10 gets marked as stable, I give that a go.  Another lock up.  Another load of garbage at the end of the messages file.

Then 3.2.1 get marked stable, and I give that a go. Though this time after KDE starts, I CTRL+ALT+F1 to the console, remotely login and start x11vnc, and then a vnc session to bring up my apps, hoping that something gets echoed to the console.  Sure enough there I see oops after oops scroll by after KDE is up and I start running my apps.  Great, I think, maybe this time I'll see something at the end of messages.  No such luck.

A long time ago I worked as a software support engineer, and learned to appreciate the value of good diagnostics, enough to realize that I couldn't start a bug or a thread without something a little more concrete.  I "rc-update del xdm default", reboot to 3.2.1 and mount a non md filesystem I don't use by default.  I tail -f /var/log/messages and redirect output to a file on to this filesystem, hoping that if I can get an oops it'll be copied there in a human readable format.  Start up KDE, start up my best guess at the app that triggers the oops (vmware workstation). Start a VM.  Start an update of vmware tools in the vm, and the oopses begin, this time copied to the non-md filesystem.  Now that I'm watching things, the death appears to be a slow one, I can go back and forth between the GUI and the console to watch the oopses. The tail worked, and here's the first fault:

```
kernel: [ 1472.767683] general protection fault: 0000 [#1] SMP 

kernel: [ 1472.767694] CPU 8 

kernel: [ 1472.767695] Modules linked in: hidp vmnet(O) vmblock(O) vsock(O) vmci(O) vmmon(O) rfcomm bnep iptable_nat nf_nat iptable_mangle ipt_REJECT xt_pkttype xt_tcpudp ipt_LOG xt_limit nf_conntrack_ipv4 nf_defrag_ipv4 xt_state iptable_filter ip_tables x_tables btusb bluetooth uvcvideo snd_hda_codec_hdmi rfkill i2c_i801 snd_hda_codec_realtek joydev firewire_ohci i7core_edac serio_raw pcspkr firewire_core snd_hda_intel processor edac_core snd_hda_codec button thermal_sys iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi fuse xfs nfs jfs reiserfs raid1 raid0 dm_snapshot dm_crypt dm_mirror dm_region_hash dm_log scsi_wait_scan hid_monterey hid_microsoft hid_logitech ff_memless hid_ezkey hid_cypress hid_chicony hid_cherry hid_belkin hid_apple hid_a4tech usbhid sx8 DAC960 cciss mptsas scsi_transport_sas mptfc scsi_transport_fc scsi_tgt mptspi scsi_transport_spi mptscsih mptbase sg sata_mv

kernel: [ 1472.767780] 

kernel: [ 1472.767786] Pid: 11597, comm: udisks-daemon Tainted: G           O 3.2.1-gentoo-r2 #3 Gigabyte Technology Co., Ltd. X58A-UD5/X58A-UD5

kernel: [ 1472.767797] RIP: 0010:[<ffffffff810d9484>]  [<ffffffff810d9484>] kmem_cache_alloc+0x54/0xd0

kernel: [ 1472.767810] RSP: 0018:ffff8805fcd0d988  EFLAGS: 00010002

kernel: [ 1472.767816] RAX: 0000000000000000 RBX: ffffffff81a3a920 RCX: 0000000000000000

kernel: [ 1472.767823] RDX: 000000000005b940 RSI: 0000000000013840 RDI: ffff880607002800

kernel: [ 1472.767830] RBP: ffff8805fcd0d9b8 R08: ffff88061fd13840 R09: ffff880492945100

kernel: [ 1472.767837] R10: ffffffffffffffff R11: ffff8805fd5e2000 R12: ffff880607002800

kernel: [ 1472.767844] R13: 0e00000000020070 R14: 0000000000000020 R15: ffffffff81378085

kernel: [ 1472.767851] FS:  00007f46f31bd700(0000) GS:ffff88061fd00000(0000) knlGS:0000000000000000

kernel: [ 1472.767858] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b

kernel: [ 1472.767865] CR2: 00007fffca993f64 CR3: 00000005ff025000 CR4: 00000000000006e0

kernel: [ 1472.767871] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

kernel: [ 1472.767878] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

kernel: [ 1472.767886] Process udisks-daemon (pid: 11597, threadinfo ffff8805fcd0c000, task ffff8806008e8000)

kernel: [ 1472.767893] Stack:

kernel: [ 1472.767898]  ffff8805fcd0d9b8 ffffffff81a3a920 ffff880492945100 0000000000000020

kernel: [ 1472.767907]  0000000000000020 0000000000000000 ffff8805fcd0d9e8 ffffffff81378085

kernel: [ 1472.767916]  0000000000000000 ffff8805fea8b020 ffff8805fea8b150 ffff8805fd5e2138

kernel: [ 1472.767925] Call Trace:

kernel: [ 1472.767934]  [<ffffffff81378085>] scsi_pool_alloc_command+0x45/0x80

kernel: [ 1472.767942]  [<ffffffff813782de>] scsi_host_alloc_command.clone.9+0x2e/0x90

kernel: [ 1472.767950]  [<ffffffff81378369>] __scsi_get_command+0x29/0xc0

kernel: [ 1472.767957]  [<ffffffff81378443>] scsi_get_command+0x43/0xc0

kernel: [ 1472.767966]  [<ffffffff8137f275>] scsi_setup_blk_pc_cmnd+0x145/0x170

kernel: [ 1472.767974]  [<ffffffff81389368>] sd_prep_fn+0x558/0xdc0

kernel: [ 1472.767983]  [<ffffffff81206d04>] blk_peek_request+0xb4/0x200

kernel: [ 1472.767991]  [<ffffffff8137fdff>] scsi_request_fn+0x4f/0x470

kernel: [ 1472.767999]  [<ffffffff81204066>] __blk_run_queue+0x16/0x20

kernel: [ 1472.768007]  [<ffffffff8120aa39>] blk_execute_rq_nowait+0x69/0xd0

kernel: [ 1472.768015]  [<ffffffff8120ab29>] blk_execute_rq+0x89/0x130

kernel: [ 1472.768023]  [<ffffffff8137f469>] scsi_execute+0xe9/0x180

kernel: [ 1472.768030]  [<ffffffff8137f5a9>] scsi_execute_req+0xa9/0x120

kernel: [ 1472.768038]  [<ffffffff81379873>] ioctl_internal_command.clone.4+0x63/0x1a0

kernel: [ 1472.768048]  [<ffffffff8103b950>] ? wake_up_process+0x10/0x20

kernel: [ 1472.768056]  [<ffffffff81054d5f>] ? wake_up_worker+0x1f/0x30

kernel: [ 1472.768063]  [<ffffffff81054dec>] ? insert_work+0x7c/0x90

kernel: [ 1472.768071]  [<ffffffff81379a26>] scsi_set_medium_removal+0x76/0xa0

kernel: [ 1472.768079]  [<ffffffff810b2fb9>] ? bdi_lock_two+0x59/0x70

kernel: [ 1472.768087]  [<ffffffff81388d71>] sd_release+0x81/0xe0

kernel: [ 1472.768094]  [<ffffffff8111172c>] __blkdev_put+0x18c/0x1c0

kernel: [ 1472.768102]  [<ffffffff8105605a>] ? queue_work+0x1a/0x20

kernel: [ 1472.768109]  [<ffffffff811117b2>] blkdev_put+0x52/0x140

kernel: [ 1472.768116]  [<ffffffff811118bf>] blkdev_close+0x1f/0x30

kernel: [ 1472.768124]  [<ffffffff810e2892>] fput+0xe2/0x210

kernel: [ 1472.768132]  [<ffffffff810dee71>] filp_close+0x61/0x90

kernel: [ 1472.768139]  [<ffffffff810def32>] sys_close+0x92/0xf0

kernel: [ 1472.768148]  [<ffffffff815b713b>] system_call_fastpath+0x16/0x1b

kernel: [ 1472.768154] Code: f6 4c 8b 7d 08 a8 10 75 7e 4d 8b 04 24 65 4c 03 04 25 08 cb 00 00 49 8b 50 08 4d 8b 28 4d 85 ed 74 6f 49 63 44 24 20 49 8b 34 24 <49> 8b 5c 05 00 48 8d 4a 01 4c 89 e8 65 48 0f c7 0e 0f 94 c0 84 

kernel: [ 1472.768189] RIP  [<ffffffff810d9484>] kmem_cache_alloc+0x54/0xd0

kernel: [ 1472.768197]  RSP <ffff8805fcd0d988>

kernel: [ 1472.768203] ---[ end trace 641e5afc3c4dfccb ]---
```

My log has 55 occurrences of the string "O 3.2.1-gentoo-r2 #3 Gigabyte", the one that follows the above is:

```
kernel: [ 1532.662596] Pid: 10245, comm: AK_V8_linux64_s Tainted: G      D    O 3.2.1-gentoo-r2 #3 Gigabyte Technology Co., Ltd. X58A-UD5/X58A-UD5
```

"AK_V8_linux" is a BOINC client application.

There are only 2 of "[ end trace" messages in the log.

Time to Google to see if there anybody else has encountered something remotely similar.  No such luck, that, or more likely, I don't know how to research this type of thing.

Any pointers?  

If it's hardware, why is it only exposed by kernels > 3.0.6? If it's app-emulation/vmware-workstation-8.0.1.528992-r2, why does the other machine not exhibit the same problem (it stayed up for hours)?  The kernels between the two machines are as similar to each other as I could make them.  There are some notable differences between them, the machine that crashes has bluetooth for mouse and keyboard, is connected via USB to my UPSs (nut and knutclient), a different CPU (both are Intel 64 bit), motherboard, RAM, video card (both are ATI) and sound card.Last edited by bpkent on Sun Apr 01, 2012 6:49 pm; edited 1 time in total

----------

## NeddySeagoon

bpkent,

 *bpkent wrote:*   

> ... copy previous .config, make menuconfig (esc, esc, save)... 

 

I would prefer to see a make oldconfig in there rather than make menuconfig.

How do you run genkernel?

It may throw away your kernel  .config anyway.

Your kernel is tainted because of all of the vmware modules you have loaded, which are well known for this sort of thing.

Prevent the vmware modules loading, reboot and try to reproduce the issue.

If that fixes it, try a different VM. e.g. Virtualbox or KVM.

----------

## bpkent

Thanks for the reply NeddySeagoon,

 *NeddySeagoon wrote:*   

> bpkent,
> 
>  *bpkent wrote:*   ... copy previous .config, make menuconfig (esc, esc, save)...  
> 
> I would prefer to see a make oldconfig in there rather than make menuconfig.

 

Isn't that the opposite of what the Gentoo Kernel Upgrade Guide says to do?

 *Upgrade Guide wrote:*   

> A much safer upgrading method is to copy your config as previously shown, and then simply run make menuconfig. This avoids the problems of make oldconfig mentioned previously, as make menuconfig will load up your previous configuration as much as possible into the menu

 

 *Quote:*   

> How do you run genkernel?
> 
> It may throw away your kernel  .config anyway.

 

```
genkernel --install --mountboot --no-clean --splash=natural_gentoo --splash-res=1920x1200 --mdadm --lvm --no-menuconfig all
```

 *Quote:*   

> Your kernel is tainted because of all of the vmware modules you have loaded, which are well known for this sort of thing.
> 
> Prevent the vmware modules loading, reboot and try to reproduce the issue.
> 
> If that fixes it, try a different VM. e.g. Virtualbox or KVM.

 

I am aware that the issue may be the fault of vmware and its modules that taint my kernel, and mentioned my use of the product because the temporal coincidence of the errors and my use of a VM, lead me to suspect it as a good candidate of the cause.  I am confused about why the same issue does not appear to manifest itself on a second PC with a very similar kernel (I have tried repeatedly to get this other machine to fail, it seems as stable as my current 3.0.6 environment).  I do understand that in the long run I am likely in a better position without dependencies on non FOSS components, and I did try converting to VirtualBox soon after my switch to Linux, unfortunately I was not very successful.  Perhaps I'll give KVM a try, though (I guess fairly obviously) I'd prefer to get my existing software operational before trying alternatives.

----------

## NeddySeagoon

bpkent,

Your approach to problem solving is good.

I'm not aware of any problems with make oldconfig.  It uses as much of the input .config as it can and only asks for new options.

I have no idea what make menuconfig does for missiung unset options.

You could try booting into memtest86+ and run a few complete cycles.  Errors do not mean you have RAM issues.

It can be much worse.  It depends on the nature of the errors.

The same error at the same address is probably RAM. Intermittent errors are probably not RAM, or not RAM alone.

Do not emerge and run memtest. You have no idea what the results mean then as it runs throught the memory manager.

To be useful, it needs to be on bare hardware.

----------

## bpkent

 *NeddySeagoon wrote:*   

> bpkent,
> 
> Your approach to problem solving is good.
> 
> I'm not aware of any problems with make oldconfig.  It uses as much of the input .config as it can and only asks for new options.
> ...

 

Upgraded the other machine to 3.2.1, copied the 3.1.6 .config went through make oldconfig, accepted defaults, backed up the resulting .config, then recopied the 3.1.6 .config, did a make menuconfig +esc+esc+save.  diff cannot find any differences between them.  I'll try a few other permutations to see if there are places where one or the other breaks.

I'll give the memtest a go, I did have RAM errors when I first assembled my PC and had replacement chips sent.  If that does not show any issues, I'll rerun my test to see if the error condition reproduces in the same way as yesterday.  I'll also test with the BOINC client disabled, as I've seen some references to this introducing instability in overclocked systems (mine is not such a system).  Thanks for the advice and guidance thus far, I doubt I'll have a chance to update until next weekend, though I'll be sure to post any new findings.

----------

## bpkent

 *bpkent wrote:*   

> I'll give the memtest a go, I did have RAM errors when I first assembled my PC and had replacement chips sent.  If that does not show any issues, I'll rerun my test to see if the error condition reproduces in the same way as yesterday.  I'll also test with the BOINC client disabled, as I've seen some references to this introducing instability in overclocked systems (mine is not such a system).  Thanks for the advice and guidance thus far, I doubt I'll have a chance to update until next weekend, though I'll be sure to post any new findings.

 

Seems my RAM is in a good state, at least memtest could not find any issues after a few hours.  I'm beginning to suspect that it might be a result of patches to vmware-workstation-8.0.1 when running with a kernel >= 3.1.0.  I've just booted into 3.0.17-r2 (used the same copy .config, make menuconfig, etc. etc. method for building the new kernel) and it seems to be stable with my usual mix of apps.  There is a new vmware installer (8.0.2) though this hasn't reached portage yet.  From what I can tell 8.0.2 is compatible with kernels < 3.2.  I might try creating my own ebuild for this and see if it's stable with a 3.1 kernel on the machine that crashes.  Stumped as to why one machine runs the without issues with vmware 8.0.1 and a 3.2 kernel, while another repeatedly ends up in a horrible mess.  Oh well.

Thanks again Neddy.

----------

## bpkent

I switched from VMware to VirtualBox a few weeks ago, and, while my system is a little more stable, there were still occasional lock-ups using > 3.0.x kernels.  Last Thursday (March 28th) I replaced the RAM and there was a lock-up 45 seconds after booting to 3.2.12.  I went into the BIOS and disabled hyperthreading, booted once again to 3.2.12, and so far so good.

Not sure that SMT has anything to do with my woes, a couple of weeks ago I ran for 4 days on 3.2.1-r2 without any issues.  Though on replacing some fans the system returned to being unstable with the same kernel, yet stable on 3.0.17.

```
Mar 25 00:38:36 bobby4 kernel: [22219.010371] Modules linked in: joydev hidp rfcomm bnep iptable_nat nf_nat iptable_mangle ipt_REJECT xt_pkttype ipt_LOG xt_limit nf_conntrack_ipv4 nf_defrag_ipv4 xt_state iptable_filter ip_tables vboxpci(O) vboxnetflt(O) vboxnetadp(O) vboxdrv(O) acpi_cpufreq mperf vhba(O) btusb bluetooth rfkill uvcvideo snd_hda_codec_hdmi pcspkr serio_raw i2c_i801 firewire_ohci snd_hda_codec_realtek firewire_core i7core_edac edac_core snd_hda_intel processor snd_hda_codec button thermal_sys iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi fuse xfs nfs jfs reiserfs raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0 dm_snapshot dm_crypt dm_mirror dm_region_hash dm_log scsi_wait_scan hid_monterey hid_microsoft hid_logitech ff_memless hid_ezkey hid_cypress hid_chicony hid_cherry hid_belkin hid_apple hid_a4tech usbhid sx8 DAC960 cciss mptsas scsi_transport_sas mptfc scsi_transport_fc scsi_tgt mptspi scsi_transport_spi mptscsih mptbase sg sata_mv

Mar 25 00:38:36 bobby4 kernel: [22219.010448]

Mar 25 00:38:36 bobby4 kernel: [22219.010451] Pid: 17089, comm: kworker/1:2 Tainted: G           O 3.2.1-gentoo-r2 #4 Gigabyte Technology Co., Ltd. X58A-UD5/X58A-UD5

Mar 25 00:38:36 bobby4 kernel: [22219.010458] RIP: 0010:[<ffffffff810d9994>]  [<ffffffff810d9994>] kmem_cache_alloc+0x54/0xd0
```

The other machine I mentioned in the OP does not have an HT enabled CPU, though the kernels for both have "CONFIG_SCHED_SMT=y".

----------

