# megasas: panic under load in >=2.6.17

## guero61

I'm trying to run the hardened/amd64/multilib profile on a Dell 6950 (Opteron SMP), and keep running into random kernel panics.  This occurs on both the stable (2.6.18-r6) and the latest unstable (2.6.20-r2) kernels, using both gcc-3.4.6-r2 (stable for hardened) and gcc-4.1.1-r3.  I hate to post a bug if this isn't one, so I'm checking to see if any of you guys have an idea first.  Some additional points:

 - can't set PCI_BIOS, as it's disabled from x86_64

 - took out the debug option to write-protect readonly kernel structures (relying on PaX solely)

 - I've flashed the most recent firmware to the system, eliminating the SCSI timeout issues.

The panic doesn't seem to be attached to any single action, just doing *something* - if I let it sit quietly, the panic never happens.  Seems to be somewhat tied to having an ethernet interface up, but I'm not doing any bridging.  All searches and fixes for hardened, panic, oops, Aiee, and so on have come up as dead ends.  It's been over a week of work, and I'm all out of ideas.  Ask me questions, offer ideas?

[edit]

This was "hardened-sources random panics", but the evidence thus far doesn't tie it to hardened-sources.

[/edit]Last edited by guero61 on Fri Apr 27, 2007 12:35 am; edited 1 time in total

----------

## moocha

Shot in the dark: Are you running ntp? Up until 2.6.17 ntp caused a hardlock on my machines when running hardened-sources - perhaps it's back.

----------

## guero61

Yes, I am running NTP.  In fact, part of the typical chain of events has been manually bringing up my network interface & restarting ntpd so it will properly contact its upstream servers.  However, I'm running NTP on my home system (32-bit nosmp hardened-sources-2.6.20-r1) with zero issues, and have been for some time.  I'll still experiment with it.

I'm in the process of trying to get my dumps run cleanly through ksymoops; hardened-sources makes that nice & hard.

----------

## moocha

I never got to the bottom of it and I tried really hard. In the end I just gave up on updating the kernel and ran 2.6.16 until the problem magically went away in 2.6.18.

Can you try checking whether NTP is the culprit?

----------

## guero61

No go - turned off NTP, and it still happened.

I had just turned off TPE (trusted path execution) to enable my installing app-admin/sudo - within 10 seconds, the panic occurred.

```

sysctl kernel.grsecurity.tpe=0

```

Got some reasonably cleaned up dumps; both are roughly the same:

```

Unable to handle kernel paging request at 0000007800000038 RIP:

[<ffffffff804264b3>]

Oops: 0000 [1] SMP

CPU 0

Pid: 0, comm: swapper Not tainted 2.6.20-hardened-r2 #3

RIP: 0010:[<ffffffff804264b3>]  [<ffffffff804264b3>]

Using defaults from ksymoops -t elf64-x86-64 -a i386:x86-64

RSP: 0018:ffffffff80679ef0  EFLAGS: 00010202

RAX: ffff8102274ac000 RBX: ffff810227d90e30 RCX: 0000000000000018

RDX: 0000000000001000 RSI: ffffffff8061bf00 RDI: ffff810227d90d00

RBP: 0000007800000000 R08: 0000000000000000 R09: ffff810226aa9e20

R10: 0000000000001000 R11: ffffffff8804dee0 R12: 0000000000000355

R13: ffff810227d90d00 R14: 0000000000000356 R15: 0000000000000000

FS:  000030666a47fff0(0000) GS:ffffffff8061a000(0000) knlGS:0000000008110670

CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b

CR2: 0000007800000080 CR3: 0000000225d27000 CR4: 00000000000006e0

stack:  ffff810227d90e30 0000000000000000 ffffffff80673408 0000000000000000

 000000000000000a ffffffff8029649b 0000000000000001 ffffffff8061aad0

 ffffffff80673500 ffffffff203128ef 0000000000000046 ffffffff80679f78

Call Trace:

 [<ffffffff802128ef>]

 [<ffffffff80262ddc>]

 [<ffffffff8026f3ac>]

 [<ffffffff8026f4e9>]

 [<ffffffff8026d340>]

 [<ffffffff802621d1>]

 [<ffffffff8026d369>]

 [<ffffffff8024bcdb>]

 [<ffffffff8063b7fa>]

 [<ffffffff8063b16d>]

Code: 48 8b 45 38 48 8b 5d 00 48 85 c0 74 0b 48 c7 80 30 01 00 00

>>RIP; ffffffff804264b3 <megasas_complete_cmd_dpc+d3/260>   <=====

>>RAX; ffff8102274ac000 <__crc_kmem_cache_free+ffff8101274e32f7/fffffffe802372f7>

>>RBX; ffff810227d90e30 <__crc_kmem_cache_free+ffff810127dc8127/fffffffe802372f7>

>>RSI; ffffffff8061bf00 <irq_desc+1180/10000>

>>RDI; ffff810227d90d00 <__crc_kmem_cache_free+ffff810127dc7ff7/fffffffe802372f7>

>>RBP; 0000007800000000 <__crc_kmem_cache_free+77000372f7/fffffffe802372f7>

>>R09; ffff810226aa9e20 <__crc_kmem_cache_free+ffff810126ae1117/fffffffe802372f7>

>>R11; ffffffff8804dee0 <_end+78536c4/7ee057e4>

>>R13; ffff810227d90d00 <__crc_kmem_cache_free+ffff810127dc7ff7/fffffffe802372f7>

Trace; ffffffff802128ef <__do_softirq+5f/d0>

Trace; ffffffff80262ddc <call_softirq+1c/28>

Trace; ffffffff8026f3ac <do_softirq+3c/90>

Trace; ffffffff8026f4e9 <do_IRQ+e9/100>

Trace; ffffffff8026d340 <default_idle+10/50>

Trace; ffffffff802621d1 <ret_from_intr+0/a>

Trace; ffffffff8026d369 <default_idle+39/50>

Trace; ffffffff8024bcdb <cpu_idle+9b/d0>

Trace; ffffffff8063b7fa <start_kernel+22a/240>

Trace; ffffffff8063b16d <x86_64_start_kernel+16d/180>

Code;  ffffffff804264b3 <megasas_complete_cmd_dpc+d3/260>

0000000000000000 <_RIP>:

Code;  ffffffff804264b3 <megasas_complete_cmd_dpc+d3/260>   <=====

   0:   48 8b 45 38               mov    0x38(%rbp),%rax   <=====

Code;  ffffffff804264b7 <megasas_complete_cmd_dpc+d7/260>

   4:   48 8b 5d 00               mov    0x0(%rbp),%rbx

Code;  ffffffff804264bb <megasas_complete_cmd_dpc+db/260>

   8:   48 85 c0                  test   %rax,%rax

Code;  ffffffff804264be <megasas_complete_cmd_dpc+de/260>

   b:   74 0b                     je     18 <_RIP+0x18>

Code;  ffffffff804264c0 <megasas_complete_cmd_dpc+e0/260>

   d:   48 c7 80 30 01 00 00      movq   $0x0,0x130(%rax)

Code;  ffffffff804264c7 <megasas_complete_cmd_dpc+e7/260>

  14:   00 00 00 00

CR2: 0000007800000038

 <0>Kernel panic - not syncing: Aiee, killing interrupt handler!

```

Here's the odd thing - after rebooting I did 'echo 0 > /proc/sys/kernel/grsecurity/tpe', which didn't initiate the panic.  Starting the network card and emerging app-admin/sudo didn't do it either.  Only when I executed:

```

qfile `which sysctl`

```

did I dump, 3:15 after the initial echo command.

[edit]

This has to be a red herring, but it crashed again between 3:15 and 3:20 (195-200 seconds) after disabling TPE.

[/edit]

----------

## moocha

Ugh. I'm afraid shots in the dark are all I have left... Could you try with a non-SMP kernel and/or with the Gentoo LiveCD and/or with a different LiveCD? Let's at least try to broadly narrow it down (if that makes sense  :Very Happy: ).

----------

## guero61

The amd64-minimal-2006.1 LiveCD works just fine; it's definitely hardened-sources, and doesn't seem to be time-based.  TPE was no luck either - booted a kernel without TPE and failed just now, but this time with a "kernel null pointer dereference".

I boot the machine, and if I leave it alone nothing happens.  However, I typically then bring up a network interface (since I have no local distfiles cache) and start installing packages; at some random point shortly thereafter, we panic.

Bringing it back up on a serial console so I can get a dump of things, but this is awfully weird.  Any devs care to comment?

[edit]

Tentatively, it looks like the issue was an interaction between XFS and the way my filesystem was structured.  The root filesystem was EXT3, but http-replicator and portage were running against XFS mounts.  All partitions are out of an LVM root, which in turn actually stems from a dm-crypt volume instead of a hard device.  I won't say 'solved' yet, but I'm down to running validation stress tests.

SO, XFS over LVM over dm-crypt over megaraid-sas seems to be a bad thing for the moment.   :Rolling Eyes: 

[/edit]

----------

## guero61

Strike that - XFS only exacerbated the issue, however it did.  I lasted longer into the stress testing (stress -d :Cool: , but shortly after attempting to emerge --fetch all sources into my shiny new ext3 repository, megasas choked, dumping 24 "megasas: MFI FW status 0x3" messages to syslog before the machine went dead on me.   :Evil or Very Mad: Last edited by guero61 on Tue Apr 24, 2007 12:57 pm; edited 1 time in total

----------

## moocha

I've tried to replicate your environment (more or less) but without any luck - on the other hand I don't have any MegaRaid hardware...

----------

## guero61

That's the road I'm headed down... the LiveCD uses version 02.00-rc4 of the megaraid_sas driver (2.6.15-gentoo-r5), whereas the current one in the tree is 03.05; Dell is saying the "best" version is 03.09, which they're nice enough to provide as a -src.tar.gz, so I'm going to manually patch it in.  If that still fails, I don't think I'll be able to wedge a 2.6.15 era driver into the 2.6.20-hardened tree, but it might be worth a try.  I've definitely narrowed it down to a disk-load error, which seems to reflect the issues people are complaining of in the linux-poweredge lists.

----------

## moocha

You're most likely correct - it's not very likely a driver for 2.6.15 will cooperate with a 2.6.20 tree.

Let's try to look at the problem in a different fashion, though: Do you need 2.6.20?

----------

## guero61

Nope; I pushed to it assuming the problem might be with the stable 2.6.18-hardened-r6.  I've been running 2.6.20-hardened at home for nearly a month, with no issue.  I could patch the 2.6.18 tree (which has 03.01), but don't see a great deal of extra value in doing that over using 2.6.20; what am I missing?

----------

## moocha

I was thinking along the lines of "let's just use the latest kernel that worked, and screw the newer ones"...

----------

## guero61

This is getting deeper than I'd like...  I thought it was a problem with dm-crypt, but doesn't seem to be.  Then I thought it was a problem with megasas, but that seems to have been eliminated too by my testing.  According to the below table of my testing, things seem to indicate it's an issue with *something* that changed between 2.6.16 and 2.6.18...

```

                    megasas-commit    megasas      dm-crypt    works

2.6.15-gentoo-r5    2005-11-10        02.00-rc4    1.1.0       yes

2.6.16-hardened-r11 2006-02-28        02.04        1.1.0       yes

2.6.16-hardened-r11 2006-07-02        03.01        1.1.0       yes, with RESETs

2.6.18.8            2006-07-02        03.01        1.1.0       no

2.6.18.8            2006-07-03        03.01        1.1.0       no

2.6.18-hardened-r6  2006-07-03        03.01        1.1.0       no

2.6.20-hardened-r2  current/hand      03.05/03.09  1.3.0       no

```

I tested 2.6.16-hardened-r11 with both it's native (02.04) driver and hand-patched with the driver from 2.6.18-hardened-r6 (03.01).  The only difference was I had to reverse the following patch from 07/03, since the IRQF flags didn't exist:

```

--- a/drivers/scsi/megaraid/megaraid_sas.c

+++ b/drivers/scsi/megaraid/megaraid_sas.c

@@ -2191,7 +2191,7 @@ megasas_probe_one(struct pci_dev *pdev, const struct pci_device_id *id)

        /*

         * Register IRQ

         */

-       if (request_irq(pdev->irq, megasas_isr, SA_SHIRQ, "megasas", instance)) {

+       if (request_irq(pdev->irq, megasas_isr, IRQF_SHARED, "megasas", instance)) {

                printk(KERN_DEBUG "megasas: Failed to register IRQ\n");

                goto fail_irq;

        }

```

I went in on the 2.6.18.8 (vanilla-sources base of hardened-sources-2.6.18-r6) and reversed the above patch so nothing differed between the working 2.6.16-hardened-r11 (hand-patched) and the vanilla-sources megaraid_sas.c, and it still failed.

The following is the command I use to replicate the issue, and consistently produces the failure within seconds:

```

bonnie++ -c 8 -n 1024 -b -d /mnt/tmp/ -s 16g -x 3 -u portage:portage

```

Every time, as the failure happens 1 to N copies of the following message spit out to the kernel log, indicating an "invalid parameter":

```

megasas: MFI FW status 0x3

```

I smell a bug report, but I don't know what to report - "something changed between 2.6.16 and 2.6.18 and makes dm-crypt over megaraid_sas fail" isn't my idea of a good bug report.

----------

## boniek

If vanilla kernel fails as well try using git-bisect to find patch that introduced this regression. Personally I don't know how to use it but I'm sure there is plenty information available online.

----------

## guero61

I'm trying (learning git/cogito as I go), but this is well beyond my level of knowledge.  In trying to cg-clone the 2.6.18 (linux/kernel/git/stable/linux-2.6.18.y.git) tree, I get failures on 3 commits, resulting in a failed tree:

```

[root@megatest git] cg-clone http://www.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.18.y.git

Initialized empty Git repository in .git/

Fetching head...

Fetching objects...

Getting alternates list for http://www.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.18.y.git/

Also look at http://www.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git/

Getting pack list for http://www.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.18.y.git/

Getting index for pack 06f3bd54edf0cfe51294a971c6d1878432858610

Getting pack list for http://www.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git/

Getting index for pack 25bdaf46d6823b132b076f6e4467d0607876ca66

Getting index for pack d7c8d1a960522394a6aa0b952bae5bb2c3b49deb

Getting index for pack 26e13df8f754e1521ff927b46e47934f2fbbffb6

Getting index for pack 836947ad3b08080097bf5d471bf1247fa03e1fe4

error: Unable to find b3008f65500fd9350ac45988de66f4dd5249604c under http://www.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.18.y.git/

Cannot obtain needed commit b3008f65500fd9350ac45988de66f4dd5249604c

while processing commit 299a2479bca6211f845158761920ec480f35a229.

progress: 3 objects, 16757 bytes

cg-fetch: objects fetch failed

```

Using HTTP is necessary due to the locked-down network - it's actually using http_proxy.  I've tried it on an alternate (unproxied) network, but the same error occurs.  I don't want to hijack my own thread for git support, but if anyone has any brilliant ideas, I'm all open.  I feel like I've waded in way over my head.

----------

## guero61

I figured a few things out, and established the 'bug' was introduced somewhere between 2.6.16.48 and 2.6.17.14 - precisely what change it was will be an exercise for my emerging git-foo.  I'd submit a bug for Gentoo, but I honestly don't believe it's anything we have something to do with, and I'm sure the devs have their hands full pushing 2007.0 out the door...   :Very Happy: 

From everything I've seen, this looks like a race condition - the crash (either a null pointer exception or "cannot handle kernel paging request") always exhibits itself in megasas_isr, and always under conditions that would have more than one token on it's queue with multiple interrupts hitting.  It doesn't help that it's a fast SMP machine.  I know I'm talking to myself by this point, but it at least helps to get things documented.  I just wish I could get a response back from LSI.

----------

