# Raid5 array crashed - how to recover? [solved]

## merlijn

I have 4 SATA2 drives setup in one software raid5 array. Everything was working fine and sometime last week I had read somewhere that it can be a good idea to run occasional checks on your md devices. So I did:

```
echo "check" >> /sys/block/md0/md/sync_action
```

According to /proc/mdstat it was checking my array and all seemed fine. As this took approx 3 hours to complete i just forgot about the whole thing, until I noticed errors in my dmesg:

```

ata5: command timeout

ata5: no sense translation for status: 0x40

ata5: translated ATA stat/err 0x40/00 to SCSI SK/ASC/ASCQ 0xb/00/00

ata5: status=0x40 { DriveReady }

sd 4:0:0:0: SCSI error: return code = 0x08000002

sdc: Current: sense key=0xb

    ASC=0x0 ASCQ=0x0

end_request: I/O error, dev sdc, sector 33941911

ata6: command timeout

ata6: no sense translation for status: 0x40

ata6: translated ATA stat/err 0x40/00 to SCSI SK/ASC/ASCQ 0xb/00/00

ata6: status=0x40 { DriveReady }

sd 5:0:0:0: SCSI error: return code = 0x08000002

sdd: Current: sense key=0xb

    ASC=0x0 ASCQ=0x0

end_request: I/O error, dev sdd, sector 33941911

ata7: command timeout

ata7: no sense translation for status: 0x40

ata7: translated ATA stat/err 0x40/00 to SCSI SK/ASC/ASCQ 0xb/00/00

ata7: status=0x40 { DriveReady }

sd 6:0:0:0: SCSI error: return code = 0x08000002

sde: Current: sense key=0xb

    ASC=0x0 ASCQ=0x0

end_request: I/O error, dev sde, sector 33941911

ata8: command timeout

ata8: no sense translation for status: 0x40

ata8: translated ATA stat/err 0x40/00 to SCSI SK/ASC/ASCQ 0xb/00/00

ata8: status=0x40 { DriveReady }

sd 7:0:0:0: SCSI error: return code = 0x08000002

sdf: Current: sense key=0xb

    ASC=0x0 ASCQ=0x0

end_request: I/O error, dev sdf, sector 33941911

raid5: Disk failure on sdf1, disabling device. Operation continuing on 3 devices

raid5: Disk failure on sde1, disabling device. Operation continuing on 2 devices

raid5: Disk failure on sdd1, disabling device. Operation continuing on 1 devices

raid5: Disk failure on sdc1, disabling device. Operation continuing on 0 devices

RAID5 conf printout:

 --- rd:4 wd:0

 disk 0, o:0, dev:sdc1

 disk 1, o:0, dev:sdd1

 disk 2, o:0, dev:sde1

 disk 3, o:0, dev:sdf1

RAID5 conf printout:

 --- rd:4 wd:0

 disk 0, o:0, dev:sdc1

 disk 1, o:0, dev:sdd1

 disk 2, o:0, dev:sde1

RAID5 conf printout:

 --- rd:4 wd:0

 disk 0, o:0, dev:sdc1

 disk 1, o:0, dev:sdd1

 disk 2, o:0, dev:sde1

RAID5 conf printout:

 --- rd:4 wd:0

 disk 0, o:0, dev:sdc1

 disk 1, o:0, dev:sdd1

RAID5 conf printout:

 --- rd:4 wd:0

 disk 0, o:0, dev:sdc1

 disk 1, o:0, dev:sdd1

RAID5 conf printout:

 --- rd:4 wd:0

 disk 0, o:0, dev:sdc1

RAID5 conf printout:

 --- rd:4 wd:0

 disk 0, o:0, dev:sdc1

RAID5 conf printout:

 --- rd:4 wd:0

```

When i realized something had gone wrong i immediately backed up the most important things since the raid was just working like before. Earlier today it seems the raid really died and cannot list anymore (it still is mounted).

Even though i did manage to get the most important parts out, I would really like to get the raid working again, or at least have another go to back things up. These disks are all < 6 months old, there is no way they could all have gone bad.

What is the next step in trying to get my valued data back?

Some other info:

```

# cat /proc/mdstat

Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] 

md1 : active raid1 sdb1[1] sda1[0]

      48064 blocks [2/2] [UU]

      

md2 : active raid0 sdb2[1] sda2[0]

      3903616 blocks 64k chunks

      

md3 : active raid1 sdb3[1] sda3[0]

      34146048 blocks [2/2] [UU]

      

md0 : active raid5 sdf1[4](F) sde1[5](F) sdd1[6](F) sdc1[7](F)

      937705728 blocks level 5, 64k chunk, algorithm 2 [4/0] [____]

      

unused devices: <none>

```

```

# mdadm --detail /dev/md0

/dev/md0:

        Version : 00.90.03

  Creation Time : Tue Dec 26 21:56:32 2006

     Raid Level : raid5

     Array Size : 937705728 (894.27 GiB 960.21 GB)

  Used Dev Size : 312568576 (298.09 GiB 320.07 GB)

   Raid Devices : 4

  Total Devices : 4

Preferred Minor : 0

    Persistence : Superblock is persistent

    Update Time : Wed Jun 13 11:46:05 2007

          State : clean, degraded

 Active Devices : 0

Working Devices : 0

 Failed Devices : 4

  Spare Devices : 0

         Layout : left-symmetric

     Chunk Size : 64K

    Number   Major   Minor   RaidDevice State

       0       0        0        0      removed

       1       0        0        1      removed

       2       0        0        2      removed

       3       0        0        3      removed

       4       8       81        -      faulty spare   /dev/sdf1

       5       8       65        -      faulty spare   /dev/sde1

       6       8       49        -      faulty spare   /dev/sdd1

       7       8       33        -      faulty spare   /dev/sdc1

```

Any help much appreciated.Last edited by merlijn on Sat Jun 16, 2007 9:40 am; edited 1 time in total

----------

## blubbi

Try to stop and restart the raid.

mdadm --manage --stop /dev/md0

mdadm --manage --run /dev/md0

another possible solution could be to hotremove one disk and hot add it on oafter the other.

For every disk do:

mdadm --manage --remove /dev/md0 /dev/sd?

mdadm --manage --add /dev/md0 /dev/sd?

or stop the array and reassemble it

mdadm --assemble --run /dev/md0

Hope I could help

----------

## RaceTM

Does this thread apply to you?

http://ubuntuforums.org/showthread.php?t=198017

----------

## RaceTM

Here is some additional info supporting the throught that it is a bug:

http://lists.openwall.net/linux-kernel/2007/02/20/261

You may want to test out another power supply in case the one you are using is not delivering constant power, and is causing your sata drives to crap out.

----------

## merlijn

Thank you for your replies.

First to get back to RaceTM: I do not believe this is hardware related, I have a Cooler Master Realpower 450W (100% efficient!), which is pretty much overkill anyway. As for the kernel bit being discussed on ubuntuforums, I am using 2.6.19-gentoo-r5, so this 'bug' is pretty much ruled out.

Fact remains that im currently more interested in getting my data back than in finding the root cause of these weird events.

blubbi: I am having some difficulties with these commands. After umounting the array I tryed the following:

```

# mdadm --manage --stop /dev/md0 

mdadm: stopped /dev/md0

# mdadm --manage --run /dev/md0 

mdadm: failed to run array /dev/md0: Invalid argument

# mdadm --run /dev/md0

mdadm: failed to run array /dev/md0: Invalid argument

# mdadm --assemble --run /dev/md0

mdadm: /dev/md0 not identified in config file.

```

I am not too sure about trying to fail devices and re-adding them as i've had some bad luck with that before. Could anyone give me any suggestions on how to get the array running once again?

Cheers,

----------

## blubbi

dou you have a mdadm.conf wheere you spcified the array?

If no,  you have to specify the hdds the array consists of on the commandline. I was wondering ybout this caus I thought that the array is autodetected... but this didn't work for me in many cases. Cause mdadm is looking for the device md0 int the config-file to know which hdds to use when reassambling.

An other nasty, I call it "BUG" habbened when I had a real defective drive. I shut down my system removed the disc, had no spare disc right now availabel, but needed to get hands on some data. The Disc was lets call it sdd. So in udev manner, when removing the drive all drives after the sdd now where no langer sde or sdf... no sde got sdd and sdf got sde.... Okay, First I thougth ... no problem the array are allocated by the UUIDS so the drive name is not a big deal.... BUT md took a disc from my raide 5 which now got sdd and synced it whit the raid1 which was missing the hdd.... and than things turned worse..... cause I made an mistake and synced anoter drive from the raid5 whith a wrong md device... so now two discs from the raid5 were gone. Thats it... all Data ********.

But I have never been in such a case you described. But I hade som faulty drives here and there which were not really fault. So I removed the drive and readded it, restarted the array and that did the trick. But in your case where all the drives are marked as faulty..... I actually dont't know how mdadm marks drives as faulty. Maybe you could echo something into /sys/block/md0/??? to tell mdadm not to think these drives are fault.

Try to rebuild the array completely from commandline, with the devices specified.

Ah and by the way, you shoould have tried hotremoving and adding bevor you stopped the device. I guess it could have worked.

Regards

blubbi

----------

## merlijn

Thank you for your reply.

I shouldve probably mentioned that I am an utter newb when it comes to software raid. I installed this once following some guide and everything worked, I was like "Wow cool, now i should be a lot safer when it comes to dataloss", and i never gave it much thought after that.

I have now added a line to /etc/mdadm.conf: ARRAY /dev/md0 devices=/dev/sdc1,/dev/sdd1,/dev/sde1,/dev/sdf1 and got this:

```

# mdadm --assemble --run /dev/md0

mdadm: /dev/sdf1 has no superblock - assembly aborted

```

Just to be careful I dont want to try too many new things to preserve any data still recoverable. Where do you suggest I go from here?

Merlijn

----------

## merlijn

To add some more info, it seems my drives don't support SMART which seems rather strange to me:

```

# smartctl -S on -a /dev/sdc

smartctl version 5.36 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen

Home page is http://smartmontools.sourceforge.net/

Device: ATA      WDC WD3200KS-00P Version: 21.0

Serial number:      WD-WCAPD3048396

Device type: disk

Local Time is: Thu Jun 14 11:15:55 2007 CEST

Device does not support SMART

Enable autosave (clear GLTSD bit) failed

Error Counter logging not supported

[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']

Device does not support Self Test logging

```

----------

## blubbi

I am not shure about this... but you could make diskdumps of eache drive... so you can play with the array... an in case you culd bring the disc back to the old state rereading the dump.

But I not sure.

Best you jump on IRC head to #gentoo on freenode and as there.

I would try to --force to assemble the array

mdadm --assemble --help

Options that are valid with --assemble (-A) are:

  --bitmap=          : bitmap file to use wit the array

  --uuid=       -u   : uuid of array to assemble. Devices which don't

                       have this uuid are excluded

  --super-minor= -m  : minor number to look for in super-block when

                       choosing devices to use.

  --name=       -N   : Array name to look for in super-block.

  --config=     -c   : config file

  --scan        -s   : scan config file for missing information

  --run         -R   : Try to start the array even if not enough devices

                       for a full array are present

  --force       -f   : Assemble the array even if some superblocks appear

                     : out-of-date.  This involves modifying the superblocks.

  --update=     -U   : Update superblock: try '-A --update=?' for option list.

  --no-degraded      : Assemble but do not start degraded arrays.

----------

## RaceTM

 *merlijn wrote:*   

> Thank you for your replies.
> 
> First to get back to RaceTM: I do not believe this is hardware related, I have a Cooler Master Realpower 450W (100% efficient!), which is pretty much overkill anyway. As for the kernel bit being discussed on ubuntuforums, I am using 2.6.19-gentoo-r5, so this 'bug' is pretty much ruled out.
> 
> Fact remains that im currently more interested in getting my data back than in finding the root cause of these weird events.
> ...

 

The power supply issue is not related to wether or not the psu is powerful enough, the isue is whether or not it is supplying constant voltage and current to you drives.  SATA drives can be flakey when it comes to power, so if your power supply is not providing steady power then this could lead to the issue that was described.  In order to test this, I would recommend reassembling your array while running a different power supply. If it works and doesn't break again, then you may want to test your power suply rails with a multimeter to test the voltage and make sure it is actually fairly constant.  As a side note, brand doesn't really matter. I'm not too fond of the coolermaster brand myself (do they even make their own power supplies or just reband other ones?) but regardless of the brand, it is always possible to have a slightly defective unit.

As for finding the root cause, well we should hope that finding it will prevent this from happening to you again next week

 :Very Happy: 

Good luck!

----------

## merlijn

Thank you both for the very useful information.

It appeared the issue was related to my Promise TX4 sata controller. I think it just decided to crap out and scare me shitless. After i swapped it for another controller and booted back up the raid5 would just be detected and worked like a charm.

 *Quote:*   

> 
> 
> The power supply issue is not related to wether or not the psu is powerful enough, the isue is whether or not it is supplying constant voltage and current to you drives. SATA drives can be flakey when it comes to power, so if your power supply is not providing steady power then this could lead to the issue that was described. In order to test this, I would recommend reassembling your array while running a different power supply. If it works and doesn't break again, then you may want to test your power suply rails with a multimeter to test the voltage and make sure it is actually fairly constant. As a side note, brand doesn't really matter. I'm not too fond of the coolermaster brand myself (do they even make their own power supplies or just reband other ones?) but regardless of the brand, it is always possible to have a slightly defective unit.
> 
> As for finding the root cause, well we should hope that finding it will prevent this from happening to you again next week
> ...

 

Even though this wasn't applicable to my case it is very useful information. I could actually rule this one out because i have 2 other sata drives in the case (on the onboard controller), but it's a good thing to keep in mind! As for the cooler master PSU, I have nothing but positive experiences. Only annoying thing is that its equipped with a bunch of blue LEDs in there.

Cheers.

----------

## blubbi

Nice to hear.

But now you got a bit into raid management  :Wink: 

regards

blubbi

----------

