# [SOLVED] mdadm raid 6 recovery.

## drescherjm

I have a server that I rebooted and 2 of the 6 raid 6 members were kicked out of the array on reboot. I believe this was a power issue because a few of the cables were being pulled out when I closed the side of the case. Anyways when I went to recover a third disk was failed so the array.

```

datastore1 ~ # cat /proc/mdstat

Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath]

md0 : active raid1 sdb1[1] sda1[0] sdf1[5] sdd1[3] sde1[4] sdc1[2]

      256896 blocks [6/6] [UUUUUU]

md126 : active raid6 sdb5[1] sda5[0] sdf5[5] sdd5[3] sde5[4] sdc5[2]

      1199283200 blocks level 6, 256k chunk, algorithm 2 [6/6] [UUUUUU]

md127 : active raid6 sda6[6](S) sdb6[7](S) sdf6[5] sdd6[8](F) sde6[4] sdc6[2]

      1680013056 blocks level 6, 64k chunk, algorithm 2 [6/3] [__U_UU]

md1 : active raid6 sdb3[1] sda3[0] sdf3[5] sdd3[4] sde3[3] sdc3[2]

      46909440 blocks level 6, 256k chunk, algorithm 2 [6/6] [UUUUUU]

 went down. Nothing would have been written to any of the filesystems between the raid going down the first time and now. 
```

The array in question is md127. The original 2 failed members are marked as spares (sda6 and sdb6) the other failed member sdd6  is marked as failed. Looking at the SMART for this disk it looks like I have a URE 

```
datastore1 ~ # smartctl --all /dev/sdd

smartctl version 5.38 [x86_64-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen

Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===

Model Family:     Seagate Barracuda 7200.10 family

Device Model:     ST3750640AS

Serial Number:    3QD0QQYF

Firmware Version: 3.AAJ

User Capacity:    750,156,374,016 bytes

Device is:        In smartctl database [for details use: -P show]

ATA Version is:   7

ATA Standard is:  Exact ATA specification draft version not indicated

Local Time is:    Thu Jan 13 11:08:13 2011 EST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status:  (0x82) Offline data collection activity

                                        was completed without error.

                                        Auto Offline Data Collection: Enabled.

Self-test execution status:      (   0) The previous self-test routine completed

                                        without error or no self-test has ever

                                        been run.

Total time to complete Offline

data collection:                 ( 430) seconds.

Offline data collection

capabilities:                    (0x5b) SMART execute Offline immediate.

                                        Auto Offline data collection on/off support.

                                        Suspend Offline collection upon new

                                        command.

                                        Offline surface scan supported.

                                        Self-test supported.

                                        No Conveyance Self-test supported.

                                        Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

                                        power-saving mode.

                                        Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

                                        General Purpose Logging supported.

Short self-test routine

recommended polling time:        (   1) minutes.

Extended self-test routine

recommended polling time:        ( 202) minutes.

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000f   117   080   006    Pre-fail  Always       -       148320400

  3 Spin_Up_Time            0x0003   094   093   000    Pre-fail  Always       -       0

  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       31

  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       2

  7 Seek_Error_Rate         0x000f   078   060   030    Pre-fail  Always       -       26208355695

  9 Power_On_Hours          0x0032   087   087   000    Old_age   Always       -       11747

 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0

 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       39

187 Reported_Uncorrect      0x0032   096   096   000    Old_age   Always       -       4

189 High_Fly_Writes         0x003a   098   098   000    Old_age   Always       -       2

190 Airflow_Temperature_Cel 0x0022   057   047   045    Old_age   Always       -       43 (Lifetime Min/Max 40/48)

194 Temperature_Celsius     0x0022   043   053   000    Old_age   Always       -       43 (0 22 0 0)

195 Hardware_ECC_Recovered  0x001a   060   053   000    Old_age   Always       -       111473581

197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       1

198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       1

199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       4

200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0

202 TA_Increase_Count       0x0032   100   253   000    Old_age   Always       -       0

SMART Error Log Version: 1

ATA Error Count: 4

        CR = Command Register [HEX]

        FR = Features Register [HEX]

        SC = Sector Count Register [HEX]

        SN = Sector Number Register [HEX]

        CL = Cylinder Low Register [HEX]

        CH = Cylinder High Register [HEX]

        DH = Device/Head Register [HEX]

        DC = Device Command Register [HEX]

        ER = Error register [HEX]

        ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 4 occurred at disk power-on lifetime: 11711 hours (487 days + 23 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 28 53 ae e0  Error: UNC at LBA = 0x00ae5328 = 11424552

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 00 00 53 ae e0 00      05:42:45.470  READ DMA EXT

  27 00 00 00 00 00 e0 00      05:42:45.465  READ NATIVE MAX ADDRESS EXT

  ec 00 00 00 00 00 a0 02      05:42:43.456  IDENTIFY DEVICE

  ef 03 46 00 00 00 a0 02      05:42:43.453  SET FEATURES [Set transfer mode]

  27 00 00 00 00 00 e0 00      05:42:43.438  READ NATIVE MAX ADDRESS EXT

Error 3 occurred at disk power-on lifetime: 11711 hours (487 days + 23 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 28 53 ae e0  Error: UNC at LBA = 0x00ae5328 = 11424552

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 00 00 53 ae e0 00      05:42:41.167  READ DMA EXT

  25 00 00 00 55 ae e0 00      05:42:41.159  READ DMA EXT

  25 00 00 00 54 ae e0 00      05:42:43.456  READ DMA EXT

  27 00 00 00 00 00 e0 00      05:42:43.453  READ NATIVE MAX ADDRESS EXT

  ec 00 00 00 00 00 a0 02      05:42:43.438  IDENTIFY DEVICE

Error 2 occurred at disk power-on lifetime: 11132 hours (463 days + 20 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 fc 55 ae e0  Error: UNC at LBA = 0x00ae55fc = 11425276

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 80 80 55 ae e0 00      04:40:23.214  READ DMA EXT

  27 00 00 00 00 00 e0 00      04:40:23.205  READ NATIVE MAX ADDRESS EXT

  ec 00 00 00 00 00 a0 02      04:40:23.151  IDENTIFY DEVICE

  ef 03 46 00 00 00 a0 02      04:40:23.146  SET FEATURES [Set transfer mode]

  27 00 00 00 00 00 e0 00      04:40:21.177  READ NATIVE MAX ADDRESS EXT

Error 1 occurred at disk power-on lifetime: 11132 hours (463 days + 20 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 fc 55 ae e0  Error: UNC at LBA = 0x00ae55fc = 11425276

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 80 80 55 ae e0 00      04:40:18.872  READ DMA EXT

  27 00 00 00 00 00 e0 00      04:40:18.864  READ NATIVE MAX ADDRESS EXT

  ec 00 00 00 00 00 a0 02      04:40:18.764  IDENTIFY DEVICE

  ef 03 46 00 00 00 a0 02      04:40:16.831  SET FEATURES [Set transfer mode]

  27 00 00 00 00 00 e0 00      04:40:21.177  READ NATIVE MAX ADDRESS EXT

SMART Self-test log structure revision number 1

SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

```

What is the best method of recovery? Should I try to force one or more of the spares back into the array or the failed disk? I have enough disks to make bit for bit dumps of every disk but that will take a really long time.

John

----------

## drescherjm

Here is more on the URE. 

```

Jan 11 22:50:47 datastore1 [20511.486741] end_request: I/O error, dev sdd, sector 1403933480

Jan 11 22:50:47 datastore1 [20511.486749] md/raid:md127: read error not correctable (sector 778796072 on sdd6).

Jan 11 22:50:47 datastore1 [20511.486756] md/raid:md127: Disk failure on sdd6, disabling device.

Jan 11 22:50:47 datastore1 [20511.486758] <1>md/raid:md127: Operation continuing on 3 devices.

Jan 11 22:50:47 datastore1 [20511.486765] md/raid:md127: read error not correctable (sector 778796080 on sdd6).

Jan 11 22:50:47 datastore1 [20511.486771] md/raid:md127: read error not correctable (sector 778796088 on sdd6).

Jan 11 22:50:47 datastore1 [20511.486777] md/raid:md127: read error not correctable (sector 778796096 on sdd6).

Jan 11 22:50:47 datastore1 [20511.486783] md/raid:md127: read error not correctable (sector 778796104 on sdd6).

Jan 11 22:50:47 datastore1 [20511.486789] md/raid:md127: read error not correctable (sector 778796112 on sdd6).

Jan 11 22:50:47 datastore1 [20511.486796] md/raid:md127: read error not correctable (sector 778796120 on sdd6).

Jan 11 22:50:47 datastore1 [20511.486802] md/raid:md127: read error not correctable (sector 778796128 on sdd6).

Jan 11 22:50:47 datastore1 [20511.486808] md/raid:md127: read error not correctable (sector 778796136 on sdd6).

Jan 11 22:50:47 datastore1 [20511.486814] md/raid:md127: read error not correctable (sector 778796144 on sdd6).

```

----------

## frostschutz

Was the raid in use while/after the disks fell out, before the recovery? That basically means that the two discs that fell out, are out of sync. So readding without requiring sync, is probably out of the question.

Now the question, what is on that md? A filesystem or LVM? Were the filesystem(s) mounted while the resync with all discs was going on? That at least would mean that the superblocks were back in sync.

If the discs can not be re-added, what you can do is find out which metadata you are using (0.9 or one of the 1.x) as well as the chunk size; then recreate the RAID using --assume-clean, with the minimum number of devices to make it go, in the same order, marking the other two discs failed (the failed disc and one of the discs that fell out, preferably the one you re-added back last, i.e you want the disc that was partially synced in and the other out).

e.g. (not tested, verify the order and choice of missing disks - if you were using a non standard chunk size add it here too). Note that this is a last resort only as recreating the raid, kills any old raid metadata that could have helped in the restore / resync process.

```
mdadm --create --metadata 1.1 (if that's what you were using before) --level 6 --assume-clean /dev/md42 sda6 missing sdc6 missing sde6 sdf6
```

Then you should get a running md device that you can operate READ ONLY.

So first thing you should do is test if the correct data is on there

```
file -s /dev/md42
```

and mount read only (and in case of ext filesystem also noload)

```
mount -o ro,noload /dev/md42 /mnt/ohmygodtheykilledkenny
```

And then you see if you have files there, find one larger file (several MB) and see if that file is intact.

If that works until here, you can then decide whether to just accept it like that and umount, fsck (possibly some corruption occur here), mount, and readd the other devices. Or whether you want to take this chance and copy everything off there to a backup storage.

----------

## drescherjm

 *Quote:*   

> Was the raid in use while/after the disks fell out, before the recovery? That basically means that the two discs that fell out, are out of sync. So readding without requiring sync, is probably out of the question. 

 

sda6 and sdb6 were dropped at boot. 

 *Quote:*   

> Now the question, what is on that md? A filesystem or LVM? Were the filesystem(s) mounted while the resync with all discs was going on? That at least would mean that the superblocks were back in sync. 

 

LVM on the md with the only PV being this raid array.

On the LVM, I believe there was only 1 logical volume mounted an xfs filesystem. Since this is an images (medical) only filesystem and I am pretty sure that no new images were added during the raid rebuild so the filesystem should have not got that far out of sync. However doesn't mounting a filesystem cause some writes?

 *Quote:*   

> If the discs can not be re-added, what you can do is find out which metadata you are using (0.9 or one of the 1.x) as well as the chunk size

 

It will be 0.9 and the chunk size is either 256 or 64. I can query for that. Right now I am using dd to make a bit for bit copy of each raid member. The lucky thing is I got in 9 new 2 TB drives today for a second raid upgrade..

Thanks a lot for your help. I will probably be doing the recovery tomorrow or Monday depending on how long it takes to copy 6 x 420 GB partitions using dd..

----------

## drescherjm

Thanks a lot. I was able to get the array back up with 

```
mdadm --create --metadata 0.9 --level 6 -n 6 -c 64 --assume-clean /dev/md2 missing missing /dev/sdc6 /dev/sdd6 /dev/sde6 /dev/sdf6
```

After this I ran xfs_check on the one filesystem under lvm and it reported no errors. 

I am now copying the 1.4TB of data to a disk and then I will rebuild the array.

BTW, I did try to force the other 2 disks back into the array but they were out of sync. I found that out by forcing a check on the raid array.

----------

## frostschutz

Great, happy backing.  :Laughing: 

But do check that the files themselves are intact. XFS is sometimes a bit ... lax when it comes to verifying filesystems.

----------

## drescherjm

I will try to check the files when this gets done. Most of these files should be in my backup under bacula so I should in theory be able to compare the bacula checksum to the data.

----------

## drescherjm

The copy failed with less than 100GB to go. The reason was the drive that has the 1 uncorrectable sector got kicked out of the array. Is there a way to prevent mdadm from kicking the drive out of the array on an error?

----------

## frostschutz

Probably not, but if you're making the copy with dd, you can just reassemble the raid assume-clean and resume the copy after the error

```

dd if=/dev/old of=/dev/new conv=noerror,sync seek=x skip=x

```

x being the sector where you want the copy to be resumed

find the smallest x that does not kick the drive out of the array

then you'll have lost 1 chunk in the copy but that should be, better than nothing

if you're not copying to a device, but to an image file, you will also have to add notrunc to the options, otherwise you'll probably have lost your old file

----------

## drescherjm

Thanks. I had already started tinkering before I read your reply and I believe I have it. I used the same idea as you..

I used ddrescue to copy the rest of /dev/sdd6 to an image file. The dd copy truncated the file at the 397 GB point instead of the 430GB point which was the end of the file. With ddrescue I was able to continue where dd left off something like

```
ddrescue -i 397G -o 397G /dev/sdd6 /mnt/sdh1/sdd6.img
```

Then I loop device to make /dev/loop1 point to the /mnt/sdh1/sdd6.img file

Then I recreated the array again with a similar command 

```
mdadm --create --metadata 0.9 --level 6 -n 6 -c 64 --assume-clean /dev/md2 missing missing /dev/sdc6 /dev/loop1 /dev/sde6 /dev/sdf6
```

Then I mounted the filesystem and continued the file copy with rsync and it appears to be working.

----------

## frostschutz

So you made a loop device for that broken disk. That's an okay approach too, of course  :Smile: 

The data block in the copy will be corrupted though. And if it happened during rsync, that probably means one file with a bit of corrupted data.

----------

