# Debugging a kernel issue

## Freebirth Toad

I have encountered a bug with the vanilla sources kernel.  Normally, I would go to kernel.org's bugzilla, but it's down.  Before I brave the LKML, I thought I'd have a go at it here.

Any kernel newer than 2.6.32 (I have tried 2.6.33, 2.6.36, 2.6.37, 2.6.39, & 3.0.4) causes silent data corruption.  I only know because random programs start segfaulting (sometimes even getting booted is impossible).  If I md5sum the contents of /usr/bin from a known working kernel, I can reliably get a md5sum failure of at least one file if I check the md5sums from a newer (broken) kernel.  I'm using the default kernel configuration with only minimal changes to support my hardware, and the problem persists.  There's no oops, or any messages in the logs (other than random daemons segfaulting because corrupted binaries are being loaded).

The symptoms are those of hardware failure, except that my SMART stats are fine (no relocated blocks), and I pass memtest86 running for hours.  But most importantly, I have run the old kernel without a crash or any sign of problem for around 95 days.  It was suspended to disk for a good portion of that, but this was while I was writing my thesis, and I didn't lose anything.  Everything seems rock solid.

If I had to guess, I would say that it has something to do with the rewrite to sata_via portion of libata that got done in 2.6.33.  It's probably some sort of weird interaction between my particular drive controller and my particular models of hard drives.  Googling has turned up nothing, but I'm sort of at a loss for what I should even be searching for.

My question is procedural: what should I do next to figure this out?  How can I get more information about what is occurring?

----------

## Jaglover

I had similar problem once, power supply was out of specs.

----------

## djinnZ

I f the requests of the hardware are more than the capacity of the PSU I have experienced semi-random issues.

Another hardware issue (interrupts and DMA conflicts and so more) especially on the I/O controllers are often reported as memory issues by the kernel.

The only message printed is  a memory corruption before the system will reset itself.

I have not found any other solution than waste for the problematic mother board of mine.

In the other case I have solved with a psu upgrade but i am speaking about a box with 11 storage units (6 pata disks + pata dvd + 2 sata + 3 usb + one usb self alimented ) connected and 7 pci devices (2 controller +  3 net + 1 wifi + graphic card) and the only symptom was a semi-random reset (often after I start to move files between different devices in sleep mode).

Not remembere if the 2.6.32 support it but you can think about use it as recover (see in documentation for kdump) to debug after crash.

Thats are my two cents.

----------

## Freebirth Toad

 *djinnZ wrote:*   

> I f the requests of the hardware are more than the capacity of the PSU I have experienced semi-random issues.
> 
> Another hardware issue (interrupts and DMA conflicts and so more) especially on the I/O controllers are often reported as memory issues by the kernel.
> 
> The only message printed is  a memory corruption before the system will reset itself.

 

I have compiled the memtest routines into the newer kernels, and they report no problems.  I suppose they might be running before the drive controller is initialized, and thus might be missing the "memory corruption" that is potentially happening.  But no errors are being reported.  Does anyone know what kernel configuration options I should enable to get maximum verbosity in the logs?

 *djinnZ wrote:*   

> I have not found any other solution than waste for the problematic mother board of mine.

 

Yeah, it's increasingly looking like I'm going to have to do that.

 *djinnZ wrote:*   

> In the other case I have solved with a psu upgrade but i am speaking about a box with 11 storage units (6 pata disks + pata dvd + 2 sata + 3 usb + one usb self alimented ) connected and 7 pci devices (2 controller +  3 net + 1 wifi + graphic card) and the only symptom was a semi-random reset (often after I start to move files between different devices in sleep mode).

 

The 450W PSU is a pretty good upgrade in quality from the original 200W PSU for this system.  The original PSU ran the system without problem for many years.  I have two SATA HDs, an IDE flash interface card, a puny graphics card, and no optical drive.  It's draws much less than 150 watts under load (and I've measured it).

 *djinnZ wrote:*   

> Not remembere if the 2.6.32 support it but you can think about use it as recover (see in documentation for kdump) to debug after crash.

 

The kernel is not crashing.  The only crashes are in user space.  So there's nothing for kdump to report.

 *djinnZ wrote:*   

> Thats are my two cents.

 

I appreciate your guesses, but what I'm asking is for a way to discover the relevant information to make a useful bug report that will ultimately result in a kernel that doesn't have this issue.  Since the old kernel works perfectly, I'm 99% sure this is a software issue, and not a hardware issue, and thus it should have a software solution.  I'm willing to do the legwork myself, I just don't know what I should be doing.

----------

## djinnZ

First verify the configuration (no oldconfig) and try to disable the heap randomization (but I suppose you are not try to run legacy binaries) and "uninline functions" in kernel config.

memtest results are not 100% sure, the phisical memory can be damaged even if the tests are negative. Try to swap positions if its possible.

Exclude any hardware issue is important.

On the MB I have wasted memtest has never report anything bugs but recompile gcc tree times was resulting in a broken gcc.

And I have forget than I have two identical abit MB running, now.

Another source of strange issues, strictly software related, can be teh libc and the base runtime components. Have you recompiled it? (make a binary package for backup first)

Also its possible than some files (of librery or binaries) are corrupted. The crash involve only some applications or all the applications than depends from a specific library (or try to access a specific device)?

I suggest you to verify first if they are more common elements in the issues/crashes not only the kernel upgrade.

Not sure but the ksm was introduced with the 2.6.33 as I remember.

As start you can compare the results of a core dump before and at crash of the single program.

----------

## Freebirth Toad

 *djinnZ wrote:*   

> First verify the configuration (no oldconfig) and try to disable the heap randomization (but I suppose you are not try to run legacy binaries) and "uninline functions" in kernel config.
> 
> memtest results are not 100% sure, the phisical memory can be damaged even if the tests are negative. Try to swap positions if its possible.
> 
> Exclude any hardware issue is important.
> ...

 

To completely eliminate the possbility that something in my software stack has been corrupted, I downloaded the latest Gentoo minimal install ISO (which has a 2.6.39 kernel), booted it, mounted my disks read only, and successfully reproduced the read corruption (once again, by MD5summing all the files in /usr/bin from a working kernel beforehand, and then checking them from the install CD kernel, I got one file to fail its checksum).

The stability of the old kernel indicates to me that it's not a hardware issue.  I've recompiled it six times, the latest with my current version of gcc, and I've never had a problem with it.  If I had bad memory, what are the chances that ALL those different kernel binaries would dodge the issue?

Anyway, I should have been more clear in the beginning.  I am convinced that it's not a hardware failure (and I've proved it's not in my usermode software stack).  I'm convinced that it's a kernel bug.  I am not trying to convince anyone that this is true, nor am I looking for solutions that assume otherwise.  What I'm looking for is a method for finding hard evidence that will convince others.  Suppose it really is a kernel bug.  What should I do to isolate the part of the kernel source that's causing it?

----------

## Jaglover

It may not be that simple. I've to cases for you, in first case everything worked great but ogg encoder produced broken files. Trying to solve this I finally downclocked the CPU and it magically fixed ogg encoding. In second case I had file corruption with an AMD, with an addon SCSI controller. I switched to onboard EIDE - same thing, random file corruption. I replaced the CPU, still all same. The last thing I swapped out was PSU and that fixed it. 

My point is newer kernels may have a rewritten driver which exposes your hardware problem.

----------

## djinnZ

 *Freebirth Toad wrote:*   

> The stability of the old kernel indicates to me that it's not a hardware issue.

  *Jaglover wrote:*   

> My point is newer kernels may have a rewritten driver which exposes your hardware problem.

 mine too, in fact I suppose ever than can be can be a sum of different factors, and I start to exclude the common hardware issues. *Freebirth Toad wrote:*   

> Suppose it really is a kernel bug.  What should I do to isolate the part of the kernel source that's causing it?

 I am not so skilled in debugging because I work in an environment (pax + grsec + hardening + -g0 gcc option) where the debug is an illusion, but the start as i know is to have a kernel dump. You need to configure a second kernel and sysrq keys to do this.

Of course is a good start reduce to minimum the kernel configuration inb order to simplify the analysis and prevent conflicts.

Read kmemeleak.txt and the documentation at http://kgdb.sourceforge.net also, and verify if you have irq remapping issues, pci qurks etc. and/or acpi problems (the infamous libata.noacpi=1 cmdline option and related or some specia driver for wmi/bios is needed).

Another cause of strange issues can be a dsdt than need fixes. Try to recompile it, just to be sure and see if there are issues in the 

If you have also verified than there are no temperature or voltage problems disable the hardware sensors and check in the bios configuration the timing values of the ram and bus and increase the spread spectrum option.

This is all i can suggest to you.

----------

## DirtyHairy

As you know that 2.6.32 works and 2.6.33 doesn't, you could try bisecting using git to isolate the change which caused the problem. However, I'd also consider the possiblity that you are witnessing a hardware error exposed by the newer kernel code (it could be something which causes the newer version of a driver to misbehave and corrupt memory).

----------

