# RAID Problem: RAID keeps failing

## CombinedEffort

I've been running Gentoo under RAID5 for a couple of years now and it's been fine until recently...

About a month ago 2 of the 4 disks in the array got booted for 'some reason'. I believe they got booted out simultaneously. I managed to re-create the array using the original disks and it re-synced and was up and running again. I made a note to 'keep an eye on it'.

Yesterday, the exact same thing happened. Two drives got booted from the array simultaneously. I've re-created the array again, but am now looking for a solution. This is my setup

```

ted ~ # uname -a

Linux ted 2.6.22-gentoo-r8 #1 SMP Sun Nov 18 12:07:07 GMT 2007 i686 Intel(R) Core(TM)2 CPU 6400 @ 2.13GHz GenuineIntel GNU/Linux

ted ~ # mdadm --detail /dev/md0

/dev/md0:

        Version : 00.90.03

  Creation Time : Sat Jan  5 17:29:19 2008

     Raid Level : raid5

     Array Size : 585954432 (558.81 GiB 600.02 GB)

  Used Dev Size : 195318144 (186.27 GiB 200.01 GB)

   Raid Devices : 4

  Total Devices : 4

Preferred Minor : 0

    Persistence : Superblock is persistent

    Update Time : Sun Jan  6 09:07:29 2008

          State : clean

 Active Devices : 4

Working Devices : 4

 Failed Devices : 0

  Spare Devices : 0

         Layout : left-symmetric

     Chunk Size : 64K

           UUID : 476b0f08:fef8e3ab:ed512d01:c611cbfe

         Events : 0.2

    Number   Major   Minor   RaidDevice State

       0       8       34        0      active sync   /dev/sdc2

       1       8        2        1      active sync   /dev/sda2

       2       8       18        2      active sync   /dev/sdb2

       3       8       50        3      active sync   /dev/sdd2

ted ~ # lspci

00:00.0 Host bridge: nVidia Corporation C55 Host Bridge (rev a2)

00:00.1 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)

00:00.2 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)

00:00.3 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)

00:00.4 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)

00:00.5 RAM memory: nVidia Corporation C55 Memory Controller (rev a2)

00:00.6 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)

00:00.7 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)

00:01.0 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)

00:01.1 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)

00:01.2 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)

00:01.3 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)

00:01.4 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)

00:01.5 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)

00:01.6 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)

00:02.0 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)

00:02.1 RAM memory: nVidia Corporation Unknown device 03bc (rev a1)

00:02.2 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)

00:03.0 PCI bridge: nVidia Corporation C55 PCI Express bridge (rev a1)

00:09.0 RAM memory: nVidia Corporation MCP55 Memory Controller (rev a1)

00:0a.0 ISA bridge: nVidia Corporation MCP55 LPC Bridge (rev a2)

00:0a.1 SMBus: nVidia Corporation MCP55 SMBus (rev a2)

00:0a.2 RAM memory: nVidia Corporation MCP55 Memory Controller (rev a2)

00:0b.0 USB Controller: nVidia Corporation MCP55 USB Controller (rev a1)

00:0b.1 USB Controller: nVidia Corporation MCP55 USB Controller (rev a2)

00:0d.0 IDE interface: nVidia Corporation MCP55 IDE (rev a1)

00:0e.0 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a2)

00:0e.1 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a2)

00:0e.2 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a2)

00:0f.0 PCI bridge: nVidia Corporation MCP55 PCI bridge (rev a2)

00:0f.1 Audio device: nVidia Corporation MCP55 High Definition Audio (rev a2)

00:17.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a2)

00:18.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a2)

01:00.0 VGA compatible controller: nVidia Corporation G71 [GeForce 7300 GS] (rev a1)

02:06.0 Multimedia controller: Philips Semiconductors SAA7146 (rev 01)

02:07.0 Multimedia video controller: Conexant CX23880/1/2/3 PCI Video and Audio Decoder (rev 05)

02:07.1 Multimedia controller: Conexant CX23880/1/2/3 PCI Video and Audio Decoder [Audio Port] (rev 05)

02:07.2 Multimedia controller: Conexant CX23880/1/2/3 PCI Video and Audio Decoder [MPEG Port] (rev 05)

02:07.4 Multimedia controller: Conexant CX23880/1/2/3 PCI Video and Audio Decoder [IR Port] (rev 05)

02:08.0 Multimedia video controller: Conexant CX23880/1/2/3 PCI Video and Audio Decoder (rev 03)

02:08.1 Multimedia controller: Conexant CX23880/1/2/3 PCI Video and Audio Decoder [Audio Port] (rev 03)

02:0e.0 FireWire (IEEE 1394): Texas Instruments TSB43AB23 IEEE-1394a-2000 Controller (PHY/Link)

03:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8052 PCI-E ASF Gigabit Ethernet Controller (rev 22)

04:00.0 VGA compatible controller: nVidia Corporation G71 [GeForce 7300 GS] (rev a1)

```

The motherboard is a Gigabyte N680SLI-DQ6. I'm not convinced it's a disk problem, since the 2 disks have 'failed' simultaneously, twice. I'm inclined to think it's a kernel / driver issue, but was wondering if anyone else has experienced this under 2.6.22-gentoo-r8. Any hints for getting the reasons the disks were booted? Is SMART going to give me any sensible information  :Wink: 

Cheers,

Rich

p.s. I would love to move to linux-2.6.23-gentoo-r5, but I seem to have some nvidia issues with my dual-card setup (hard locks - but that's another story).

----------

## andreas_st

It would certainly be a good idea to use smartmontools to see if your disks report any errors.

```
emerge smartmontools

smartctl -s on /dev/$DEVICE (for all your disks)

smartctl -a /dev/$DEVICE (for all your disks)
```

----------

## CombinedEffort

smartctl did show one of the drives with errors and, as I was looking for an excuse, I've splashed for 6 new 500GB drives for a 2TB array  :Smile: 

```
ted ankh # mdadm --detail /dev/md0

/dev/md0:

        Version : 00.90.03

  Creation Time : Thu Jan 10 17:56:35 2008

     Raid Level : raid5

     Array Size : 1953545984 (1863.05 GiB 2000.43 GB)

  Used Dev Size : 488386496 (465.76 GiB 500.11 GB)

   Raid Devices : 5

  Total Devices : 5

Preferred Minor : 0

    Persistence : Superblock is persistent

    Update Time : Sat Jan 12 10:44:07 2008

          State : active

 Active Devices : 5

Working Devices : 5

 Failed Devices : 0

  Spare Devices : 0

         Layout : left-symmetric

     Chunk Size : 64K

           UUID : 8233ab32:afa1aa7e:04124d60:c920ca15

         Events : 0.10

    Number   Major   Minor   RaidDevice State

       0       8        0        0      active sync   /dev/sda

       1       8       16        1      active sync   /dev/sdb

       2       8       32        2      active sync   /dev/sdc

       3       8       48        3      active sync   /dev/sdd

       4       8       64        4      active sync   /dev/sde

ted ankh # 

```

The 6th drive will be a back up / cold spare.

Cheers,

Rich

----------

## CombinedEffort

*sigh*

Well, the new disks / cables didn't solve the problem - had 2 disks kicked out of the array AGAIN last night.

I'm going to try going back to a slightly older kernel revision - maybe 2.6.22-gentoo-r5 - and see if that stablises things.

Rich

----------

## CombinedEffort

I decided to rebuild the server using 2.6.24 on a spare disk and mount the RAID as a data volume, i.e.

```
/dev/sda1               /               reiserfs        noatime         0 1

/dev/md0                /data           reiserfs        auto,noatime    0 0

```

Previously, I was booting into the RAID, so when the RAID was lost, so were the logs. Now the logs are off-RAID, so I can see the exact error messages :

```
Mar 31 16:13:00 ted ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen 

Mar 31 16:13:00 ted ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 

Mar 31 16:13:00 ted res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 

Mar 31 16:13:00 ted ata4.00: status: { DRDY }

```

I've uploaded my syslog to here:

http://www.combinedeffort.com/messages2.bz2

The log gets spammed with the following message immediately before the RAID loses 2 disks:

```
saa7146 (0) saa7146_i2c_writeout [irq]: timed out waiting for end of xfer
```

I don't know if the card with the saa7146 chip (a DVB-T card) is simply a symptom of the IRQ failure or a cause. Looking at the interrupt assignments reveals:

```
ted ~ # cat /proc/interrupts 

           CPU0       CPU1       

  0:        130          0   IO-APIC-edge      timer

  1:          2          0   IO-APIC-edge      i8042

  2:          0          0    XT-PIC-XT        cascade

  5:   21991437          0   IO-APIC-fasteoi   sata_nv, ohci_hcd:usb2, cx88[1], saa7146 (0), cx88[1], cx88[1]

  6:          5          0   IO-APIC-edge      floppy

  8:          2          0   IO-APIC-edge      rtc

 10:    3652902          0   IO-APIC-fasteoi   sata_nv, ehci_hcd:usb1, HDA Intel, nvidia

 11:    6554261          0   IO-APIC-fasteoi   sata_nv, ohci1394, cx88[0], cx88[0]

221:    4601040          0   PCI-MSI-edge      eth0

NMI:          0          0 

LOC:   10158193   10158223 

ERR:          1

MIS:          0

```

The sata_nv and saa7146 share an interrupt (5). It looks like that interrupt is failing, causing all sorts of problems...

Any thoughts?

Cheers,

Rich.

p.s. The cat /proc/interrupts was done after a reboot back into a non-IRQBALANCE kernel, hence there are no interrupts on CPU1Last edited by CombinedEffort on Wed Apr 02, 2008 7:53 am; edited 1 time in total

----------

## paulbiz

I have an NVidia 680i-based motherboard with RAID5 in Gentoo (combination of nforce and silicon image SATA controllers onboard) and I also had a TV card with the exact same chip (saa7146) in here. The TV card caused so many system lockups and random problems I removed it from my system. Everything has been running well without the TV card. That may just be a coincidence, but I thought I should let you know my story in case it helps!

----------

## CombinedEffort

My current suspect (and there have been many) is now the IRQBALANCE ('Processor type and features','Enable kernel irq balancing') kernel option.

2.6.22-r10 had been rock solid without this option (the default).

2.6.2[34] had been flaky as hell with this option (the default).

I turned on IRQBALANCE in 2.6.22-r10 last night and the RAID bombed twice within about 3 hours trying to record a 30 min DVB-T show...

I turned it off again and so far (10 hours, including a 3 hour DVB-T recording) it seems stable.

Are there known issues with IRQBALANCE?

Cheers,

Rich.

----------

## paulbiz

I don't have a kernel option for IRQ balancing (amd64 2.6.24) but I am using sys-apps/irqbalance ... I never thought to try disabling it.

There is a document in the kernel sources that explains how to set CPU interrupt affinity for specific IRQ:

/usr/src/linux/Documentation/IRQ-affinity.txt

Perhaps there's some tricks in there you can try with your DVB card's IRQ to make it more well-behaved.

----------

## CombinedEffort

Well, the machine has been running for about a week on 2.6.24-r4 without any RAID issues, so I guess the IRQBALANCE option did the trick.

I did use sys-apps/irqbalance app for quite a while on 2.6.22 without any problems, so I don't know how that implementation differs from the kernel one...

Cheers,

Rich.

----------

