# Disks failed in RAID, can't make sense of the situation

## Zolcos

I noticed some disk failures in my array. Here's the output of cat /proc/mdstat

```
Personalities : [raid1] [raid6] [raid5] [raid4]

md3 : active raid6 sdc2[0](F) sdg2[4] sdh2[5](F) sdi2[6](F) sdj2[7] sdf2[3] sde2[2] sdd2[1]

      17575581696 blocks super 1.2 level 6, 512k chunk, algorithm 2 [8/5] [_UUUU__U]

md2 : active raid6 sdc1[0] sdg1[4] sdh1[5] sdi1[6](F) sdj1[7] sdf1[3] sde1[2] sdd1[1]

      5993472 blocks super 1.2 level 6, 512k chunk, algorithm 2 [8/7] [UUUUU_UU]

md1 : active raid1 sdb2[1] sda2[0]

      60345280 blocks [2/2] [UU]

md0 : active raid1 sda1[0] sdb1[1]

      61376 blocks [2/2] [UU]
```

I panicked at first because of 3 failures in a raid6. However, it says "active", not degraded, which makes me think the errors happened separately with a successful rebuild in between. So I just have to replace the faulty disks. But if that is the case, why am I still getting filesystem errors? For example:

```
aeheathc@castellan ~ $ sudo cat /var/log/messages

Password:

sudo: unable to open /var/db/sudo/aeheathc/0: Read-only file system

cat: /var/log/messages: Input/output error
```

deluge is also giving me a mix of "file too short" and "Input/output error". How can I find out what is still wrong?

When I set up mdadm I remember having it send me a test message to make sure it could notify me of errors. I still have the test message but none were sent for these drive failures -- is it likely a config error?

----------

## NeddySeagoon

Zolcos,

First this first.  Don't Panic!  Next, don't do anything that involves writes. 

The good news first 

```
md2 : active raid6 sdc1[0] sdg1[4] sdh1[5] sdi1[6](F) sdj1[7] sdf1[3] sde1[2] sdd1[1]

      5993472 blocks super 1.2 level 6, 512k chunk, algorithm 2 [8/7] [UUUUU_UU] 
```

Is only down one drive and may respond to rebuilding. 

```
sdi1[6](F)
```

Try 

```
mdadm --remove /dev/md2 /dev/sdi1

mdadm --re-add /dev/md2 /dev/sdi1
```

the re-add will probably fail, if so 

```
mdadm --add /dev/md2 /dev/sdi1
```

will add /dev/sdi1 back as if it were a brand new drive and force a rebuild.

md3 is dead. As its flagged active, its will also be read only by now, not that anything can be read or written to it with three drives missing. 

It looks like it came up in degraded mode, then lost another drive. The will be useful information in dmesg, preserve all of dmesg if you can. IF you have cycled the power, the dmesg will be lost.

Boot anyhow you like but no not let any writes occur to the components of md3, if the box is still up, thats ok too.

```
md3 : active raid6 sdc2[0](F) sdg2[4] sdh2[5](F) sdi2[6](F) sdj2[7] sdf2[3] sde2[2] sdd2[1]

      17575581696 blocks super 1.2 level 6, 512k chunk, algorithm 2 [8/5] [_UUUU__U]
```

Run 

```
mdadm -E /dev/sd[cdefghij]2
```

and post it.

We are looking at the event counts and update date/times for your raid elements.  Ideally we will find that there are six drives with identical event counts and update dates, which we may be able to use to salvage your data and even your raid set from.

Post your dmesg if it still contains information about the raid failures.

----------

## Zolcos

```
castellan aeheathc # mdadm --remove /dev/md2 /dev/sdi1

mdadm: hot removed /dev/sdi1 from /dev/md2

castellan aeheathc # mdadm --re-add /dev/md2 /dev/sdi1

mdadm: --re-add for /dev/sdi1 to /dev/md2 is not possible

castellan aeheathc # mdadm --add /dev/md2 /dev/sdi1

mdadm: failed to write superblock to /dev/sdi1
```

The machine is still up since the problem happened so dmesg should have what we need. I'm not too worried about md2 since it's swap. md3 is the important one

dmesg: http://pastebin.com/kbiN1Jpk

mdadm: http://pastebin.com/xB5aBHke

----------

## NeddySeagoon

Zolcos,

```
[    9.382511] md/raid:md3: device sdc2 operational as raid disk 0

[    9.382514] md/raid:md3: device sdg2 operational as raid disk 7

[    9.382516] md/raid:md3: device sdh2 operational as raid disk 6

[    9.382517] md/raid:md3: device sdi2 operational as raid disk 5

[    9.382518] md/raid:md3: device sdj2 operational as raid disk 4

[    9.382520] md/raid:md3: device sdf2 operational as raid disk 3

[    9.382521] md/raid:md3: device sde2 operational as raid disk 2

[    9.382522] md/raid:md3: device sdd2 operational as raid disk 1

[    9.383320] md/raid:md3: allocated 8520kB

[    9.383382] md/raid:md3: raid level 6 active with 8 out of 8 devices, algorithm 2
```

shows that md3 came up with all 8 drives at boot time, (so did md2).

```

Things start going wrong here [2060152.169192] ata3.00: exception Emask 0x10 SAct 0x7fd SErr 0x400100 action 0x6 frozen

[2060152.169196] ata3.00: irq_stat 0x08000000, interface fatal error

[2060152.169198] ata3: SError: { UnrecovData Handshk }

[2060152.169201] ata3.00: failed command: WRITE FPDMA QUEUED

[2060152.169206] ata3.00: cmd 61/80:00:a0:b4:6b/00:00:38:01:00/40 tag 0 ncq 65536 out

[2060152.169206]          res c0/00:40:90:19:b0/00:00:38:01:00/40 Emask 0x12 (ATA bus error)

[2060152.169208] ata3.00: status: { Busy }

[2060152.169210] ata3.00: failed command: WRITE FPDMA QUEUED

[2060152.169214] ata3.00: cmd 61/a0:10:00:b6:6b/00:00:38:01:00/40 tag 2 ncq 81920 out

[2060152.169214]          res c0/00:40:90:19:b0/00:00:38:01:00/40 Emask 0x12 (ATA bus error)
```

and by the time we get to 

```
[2060152.592601]  --- level:6 rd:8 wd:7

[2060152.592605]  disk 1, o:1, dev:sdd2

[2060152.592606]  disk 2, o:1, dev:sde2

[2060152.592608]  disk 3, o:1, dev:sdf2

[2060152.592609]  disk 4, o:1, dev:sdj2

[2060152.592611]  disk 5, o:1, dev:sdi2

[2060152.592612]  disk 6, o:1, dev:sdh2

[2060152.592613]  disk 7, o:1, dev:sdg2
```

sdc2 has been dropped, next out is 

```
[2186409.195810]  --- level:6 rd:8 wd:6

[2186409.195811]  disk 1, o:1, dev:sdd2

[2186409.195813]  disk 2, o:1, dev:sde2

[2186409.195814]  disk 3, o:1, dev:sdf2

[2186409.195815]  disk 4, o:1, dev:sdj2

[2186409.195817]  disk 6, o:1, dev:sdh2

[2186409.195818]  disk 7, o:1, dev:sdg2
```

and thats you down to six out of eight drives. Then down to 5 

```
[2199107.057725] RAID conf printout:

[2199107.057729]  --- level:6 rd:8 wd:5

[2199107.057731]  disk 1, o:1, dev:sdd2

[2199107.057733]  disk 2, o:1, dev:sde2

[2199107.057734]  disk 3, o:1, dev:sdf2

[2199107.057735]  disk 4, o:1, dev:sdj2

[2199107.057737]  disk 7, o:1, dev:sdg2
```

all the other md3 errors are due to the raid not having enough drives to support any I/O at all.

Its unlikely that you have had 3 drives fail so close together, a single point of failure is far more likely, so before we try to do anything with the raid itself, lets look for other causes.

Are the three failed drives on the same interface card, on the same PSU in anther rack ...  do they have anything in common at all?

If so, that's suspect, before your drives.

Get smartmontools.  Run smartctl -a /dev/...  on each drive, including the 'good' ones.  This dumps the drives internal error log to the console.

According to the smart data this drive still works ...

```

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0

  2 Throughput_Performance  0x0005   131   131   054    Pre-fail  Offline      -       119

  3 Spin_Up_Time            0x0007   120   120   024    Pre-fail  Always       -       480 (Average 476)

  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       1408

  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       40

  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0

  8 Seek_Time_Performance   0x0005   123   123   020    Pre-fail  Offline      -       34

  9 Power_On_Hours          0x0012   099   099   000    Old_age   Always       -       9532

 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0

 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       1407

192 Power-Off_Retract_Count 0x0032   099   099   000    Old_age   Always       -       1408

193 Load_Cycle_Count        0x0012   099   099   000    Old_age   Always       -       1408

194 Temperature_Celsius     0x0002   214   214   000    Old_age   Always       -       28 (Min/Max 17/48)

196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       40

197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       3
```

but it has 40 reallocated sectors ...  When I retired it a few weeks ago, it had 56 sectors it could no read. They are  'fixed' now, by writing to the unreadable sectors but the data that was there was never recovered. 

I found out about it trying to rebuild a raid5 where it was a 'good' drive. Read the smartmontools man page. Its worth trying the short test and the long test. As other than the commands to the drive, its all internal.

So much for the what went wrong and when.  Now for the prognosis.

The array was last working with

```
[2186409.195811]  disk 1, o:1, dev:sdd2

[2186409.195813]  disk 2, o:1, dev:sde2

[2186409.195814]  disk 3, o:1, dev:sdf2

[2186409.195815]  disk 4, o:1, dev:sdj2

[2186409.195817]  disk 6, o:1, dev:sdh2

[2186409.195818]  disk 7, o:1, dev:sdg2
```

```
   Update Time : Thu Dec 20 21:38:10 2012

       Checksum : 3e9f91bd - correct

         Events : 8150
```

```
    Update Time : Fri Dec 28 01:10:44 2012

       Checksum : e3a05073 - correct

         Events : 195774
```

There is 8 days between those two

```
    Update Time : Fri Dec 28 01:10:44 2012

       Checksum : bd951c28 - correct

         Events : 195774
```

```
    Update Time : Fri Dec 28 01:10:44 2012

       Checksum : abe04186 - correct

         Events : 195774
```

```
    Update Time : Fri Dec 28 01:10:44 2012

       Checksum : bb28c4c0 - correct

         Events : 195774
```

```
    Update Time : Fri Dec 28 01:10:44 2012

       Checksum : fc8c9555 - correct

         Events : 195774
```

Well, it was all looking good until 

```
mdadm: No md superblock detected on /dev/sdh2.

mdadm: No md superblock detected on /dev/sdi2.
```

You have 5 useful drives, sdc2, will be too out to be very useful, you need one or more of /dev/sdh2 or /dev/sdi2

Don't give up yet - maybe those drives have a connection problem.   e.g. can you read anything at from the drives?  Smartmontools will help there.

There is still hope, even if the superblocks are really gone.

Do not attempt to assemble the raid, do not do anything that will write to the drives. Play with smartmontools and look for a single point of failure.

----------

## Zolcos

/dev/sdh and /dev/sdi refuse to cooperate with smartctl and give this error:

```
Short INQUIRY response, skip product id

A mandatory SMART command failed: exiting.
```

These drives are spread across two controllers, but the problematic ones are not isolated. /dev/sda through /dev/sdf are on the primary controller and /dev/sdg through /dev/sdj are on an embedded Intel C606 SAS controller.

It's possible /dev/sdh and /dev/sdi are on the same power cable which is faulty, but it will be about a week before I can return to the physical location to check that.

Anyway, short test for c-g and j passed without errors. I'll update with full logs when the long tests finish

----------

## NeddySeagoon

Zolcos,

Do you have any /dev nodes for /dev/sdh or /dev/sdi ?

If so, the drives have spun up and the partition tables have been read, well, if you have the nodes for the partitions anyway.

It looks like /dev/sdh2 was the last drive to drop out, when you went from 6 to 5 drives, thats really the one we want.

----------

## Zolcos

Good point. I do have /dev nodes for sdh and sdi including the partitions /dev/sdh1 etc.

smartctl results: http://pastebin.com/5HKzB9KR

sdh and sdi are detected incorrectly with what looks like random information in the capacity and vendor fields.

----------

## NeddySeagoon

Zolcos,

No errors - the tests all passed and no reallocated sectors either.

We need to wait until you can try to get physical access to see whats happened to /dev/sdh and /dev/sdi

Theres no hurry from my end, I'm around most days and your post will pop up in my 'your posts' search whenever you get more information, so I won't miss it.

----------

## NeddySeagoon

Zolcos,

Random thought for the day ... maybe you have suffered a brownout or two?

The incoming mains power drops a few volts but not enough to trigger a reset.  The box PSU does not quite hold the output voltage within limits and random things happen, like drives being partially rest.

This may make them report odd values and odd data until they are properly power cycled.

Do you have a UPS?

Is it a proper UPS, where the battery floats across the supply and the inverter always supplies the load, like laptop, or it it one of those that monitor the supply then switch in quickly, when the supply fails.

The latter is good for total supply failures but behaviours vary under brownout conditions. Its my opinion that the latter type are next to useless.

Brownouts should be rare and you appear to have experienced two, that makes it unlikley, however, if /dev/sdh and /dev/sdi both look like they are back to normal after a power off reset, it looks like the explaination is power transients somewhere.

----------

## Zolcos

This is my UPS: http://www.newegg.com/Product/Product.aspx?Item=N82E16842101174

I went through the apcupsd log and there are no power events, only sucessful tests.

I avoided rebooting the machine so far just in case. I tried it just now and I can no longer connect with SSH so I don't think it came back up.

----------

## NeddySeagoon

Zolcos,

Your UPS is a Line Interactive type. Its a half way between the standby type, which is next to useless and the double conversion type, which is the best.

Provided the UPS works, its unlikely an external power issue caused problems.

When did you test the UPS last ? 

If you need anything on the dead raid to boot, the box won't come back up because the raid won't assemble.

If nothing there is needed, it sounds like you have another problem.

----------

## Zolcos

The last scheduled test noted in the log was Dec 19 2012.

md3 contains both /var and /tmp so I think that's the only problem.

----------

## NeddySeagoon

Zolcos,

It will be missing /var.  /tmp is flushed on reboot and will be ok on / when the mount fails.

----------

## Zolcos

I'm back with the box now. It actually booted successfully without a /var, but ssh failed to start (not surprising). There aren't many logs available without /var but I noticed that smartctl now returns sane results for /dev/sdh and /dev/sdi. Will post logs for them when the tests finish.

----------

## NeddySeagoon

Zolcos,

That supports the 'brownout' theory.

How many drives do you have per power cable from the PSU?

>2 is probably the wrong answer.

What do /dev/sdi and /dev/sdh have in common by way of power wiring

Run the 

```
mdadm -E /dev/sd[cdefghij]2
```

again and post the results.

----------

## Zolcos

Finally got complete results from smartctl for sdh and sdi, and tests passed: http://pastebin.com/quXynRgj

Here is the mdadm output you asked for: http://pastebin.com/1AMYERq2

About power cabling: I compared the serial numbers given by smartctl to the drives in the case and where they are physically connected, and found this:

Cable 1: 8 fans

Cable 2: j, a, b

Cable 3: g, h, i

Cable 4: f, e, d, c

(sda and sdb are SSDs)

It's worth mentioning that I started an operation with a moderate amount of sustained i/o that ran for 3 days straight until this happened. I guess with sufficient load that many drives draw enough power to be too much for their cable?

The psu is modular and has 2 unused sockets so I can spread them around to have a max of 2 HDDs per cable if I can find more sata cables for it.

Single rail btw. http://www.newegg.com/Product/Product.aspx?Item=N82E16817121088

----------

## NeddySeagoon

Zolcos,

A single 12v rail PSU is a bad choice.  Its not that it cannot supply that claimed static loads, its the required dynamic performance thats very difficult to achieve.

The CPU part of the 12v load can go from almost nothing to 10A in a single CPU clock.  A single rail supply has to provide for that dynamic load change over the entire load envelope.  It also has to do the reverse, when the load drops by 10A.

A dual 12v PSU has an easier life.  It still has to cope with the 10A step but the total 12v load per converter is less, so its easier to improve the dynamic regulation.  Further, cross coupling between the 12v rails is low, so that even if the CPU 12v goes out of spec, there is no impact on the HDD.  It doesn't even hurt CPU operation as the 12v is converted to about 1v on the motherboard for use by the CPU. This 1v converter on the motherboard has a much more stringent task.  The 1v is provided at around 100A for the CPU and it has to be held within a few mV over the entire CPU power dynamic range.

Don't replace the PSU now - but choose a dual (at least) rail PSU next time.

Do split the HDD power up ... no more that two drives per cable.

Now to your raid itself

```
   Update Time : Thu Dec 20 21:38:10 2012

       Checksum : 3e9f91bd - correct

         Events : 8150
```

look at the date and event count.

Thats a lost cause

```

    Update Time : Sun Dec 30 02:39:09 2012

       Checksum : e3a308ca - correct

         Events : 195932
```

You need 5 more drives like that

```
    Update Time : Sun Dec 30 02:39:09 2012

       Checksum : bd97d47f - correct

         Events : 195932
```

```

    Update Time : Sun Dec 30 02:39:09 2012

       Checksum : abe2f9dd - correct

         Events : 195932
```

```
    Update Time : Sun Dec 30 02:39:09 2012

       Checksum : bb2b7d17 - correct

         Events : 195932
```

```

    Update Time : Sat Dec 22 12:17:56 2012

       Checksum : 2349115d - correct

         Events : 195678
```

Thats a week out of date, you really don't want to use that.

```

    Update Time : Sat Dec 22 08:45:51 2012

       Checksum : fc31a37e - correct

         Events : 179307
```

```

    Update Time : Sun Dec 30 02:39:09 2012

       Checksum : fc8f4dac - correct

         Events : 195932
```

Its now a question of how lucky you are.  We know that the raid went down at Sat Dec 22 12:17:56 2012 at thats the date on /dev/sdh2, which left the raid with only 5 drives, so no further useful accesses were possible.  Notice the difference in event counts.

Its possible to recreate the raid, mount whatever you get read only and look around.

Heres the theory.  You choose your 'best' 6 drives and use mdadm --create --assume-clean with the components in the right order, (or missing) just as you did when you created the array.  This writes new raid superblocks on the six drives and starts the raid in degraded mode.

No rebuild will commence, for two reasons, there are no redundant drives to rebuild onto and you have used --assume-clean. The detail is here. Read it and understand it before you do anything you might regret.

The location of the raid superblock varies from superblock version to superblock version - you must get that right. All the other data you need to feed mdam is in your mdadm -E pastebin.  Keep a copy safe.

Be sure to use mount with the read only option to look around.  The differences in event counts mean the raid is not consistant, so if mount works, you want to copy your data off somewhere, examine it, then recreate the raid from the recovered copy.

----------

## Zolcos

Thanks. I think I understand the recreation process pretty well from that link now.

I'm about to build a different computer with fewer devices that could use this PSU so replacing it is a reasonable option.

Once I rescue the data, is there any way to detect which files are potentially corrupted?

----------

## NeddySeagoon

Zolcos,

There is no automatic way that I know of.

Corruption takes many forms too. It depends exactly what is corrupted.

If its your filesystem metadata, things may vanish, deleted items reappear or even some filesystem blocks belong to more than one file.

fsck will find some of the above items but thats a last resort, as it writes to your filesystem.  Worse, it makes some guesses about whats 'right' when there is conflicting metadata, so there is no undo.

fsck makes the filesystem meta data self consistent, it says nothing about user data. Be warned that self consistent does not mean correct.

User data will need to be 100% audited. 

It boils down to how much effort you want to put into data recovery.  Only you can judge that.

The real safe next step is to image all 6 drives before you try to recreate the array.  That requires sufficient HDD space for the images. It gets you an undo, if you mess up but you won't know whats there until you try.

----------

## Zolcos

After recreating the array, all the data seemed to be there. I copied it off, and made a new array with those disks with a new filesystem and copied the data back on to it.

After a somewhat extensive audit, everything seems to be fine. I guess I was lucky.

Also, I found out what was wrong with the monitoring -- I had it configured correctly but hadn't added the monitoring daemon to RC.

I think I'm good to go now. Thanks for your help.

----------

## NeddySeagoon

Zolcos,

Raid is a reliability enhancement, not a substitute for validated backups.

The theory behind reliability enhancement is flawed too. It assumes that all failures are random.

We all know that some failures are systematic and are built into all drives of the same part number.

There have been some famous cases ... the IBM Deathstar click of death ...

Raid cannot protect against two drives failing in quick sucession due to the same systematic failure.

Using drives from different vendors and different batches helps.

I run raid5 most places and have taken a deliberate decision to only back up user data - thats the valuable bit to me.

I can afford a day or two for the system to be down while I reinstall.  None of my stuff is 'production', unless you count my mirror of kernel-seeds.org 

A word of warning about mdadmd.  It will email you about changes in raid status. If a raid set starts in degraded mode at boot, you will not get an email. You need to look.

Its worth running a repair or a check once a month. Check works on read only raid sets. Repair does what it says.  If you have a read problem you don't even know about, it will be spotted and the data on the affected drive rewritten.

Its good to run smartd too so you can see the reallocated sector count change, when repair does its stuff.

Good luck, I'm pleased that it all worked out.

----------

