# HELP: 2of6 devices faulty in Softraid-5

## Master One

I really thought, my storage server is kind of save, having a softraid-5 with 6 pieces Maxtor MaxLine II Plus 250GB IDE drives (these are ultra-reliable enterprise-class disk drives designed for low-I/O enterprise applications such as midline, nearline, NAS and other secondary storage solution), but yesterday noon something terrible happened, I could not imagine before. Looks like two drives are now faulty with different issues. Luckily these drives have a 5y warranty, I already requested two in-advance replacement drives, which I expect to arrive in the upcoming week. so it is not about the drives themself, but the content of my 1.25TB raid (it does not contain any real important stuff, but it would be quite a loss, if it could not be recovered).

This softraid-5 was built as follows:

```
Device0 /dev/hda1

Device1 /dev/hdc1 -> failed on 18th November

Device2 /dev/hde1

Device3 /dev/hdg1

Device4 /dev/hdi1 -> failed on 25th October

Device5 /dev/hdk1

Filesystem is reiserfs

Assembled as /dev/md7
```

After some analyses I could find out, that both drives did not die at the same time, looks like /dev/hdi already failed on 25th October (that server is unattended in an office, usually once a week software is updated, but because network access to the samba-share on that raid-device was working without any flaws until yesterday, the status of the raid was only checked occasionally, and as seen, not within the last month). I didn't know, that a raid-5 could operate with one device down without a spare drive, I thought a defective drive would have to be replaced, before the raid can be started again this way.

So this is the part of the syslog, when the first drive (/dev/hdi) died:

```
Oct 25 03:12:45 storemaster hdi: dma_intr: status=0x51 { DriveReady SeekComplete Error }

Oct 25 03:12:45 storemaster hdi: dma_intr: error=0x40 { UncorrectableError }, LBAsect=273969472, high=16, low=5534016, sector=273969471

Oct 25 03:12:45 storemaster end_request: I/O error, dev hdi, sector 273969471

Oct 25 03:12:45 storemaster raid5: Disk failure on hdi1, disabling device. Operation continuing on 5 devices

Oct 25 03:12:45 storemaster disk 4, o:0, dev:hdi1
```

And now what happened yesterday:

```
Nov 18 12:17:05 storemaster hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }

Nov 18 12:17:05 storemaster hdc: dma_intr: error=0x40 { UncorrectableError }, LBAsect=348709327, high=20, low=13165007, sector=348709327

Nov 18 12:17:05 storemaster end_request: I/O error, dev hdc, sector 348709327

Nov 18 12:17:05 storemaster raid5: Disk failure on hdc1, disabling device. Operation continuing on 4 devices

Nov 18 12:17:07 storemaster hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }

Nov 18 12:17:07 storemaster hdc: dma_intr: error=0x40 { UncorrectableError }, LBAsect=348709335, high=20, low=13165015, sector=348709335

Nov 18 12:17:07 storemaster end_request: I/O error, dev hdc, sector 348709335

Nov 18 12:17:08 storemaster hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }

Nov 18 12:17:08 storemaster hdc: dma_intr: error=0x40 { UncorrectableError }, LBAsect=348709351, high=20, low=13165031, sector=348709343

Nov 18 12:17:08 storemaster end_request: I/O error, dev hdc, sector 348709343

Nov 18 12:17:10 storemaster hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }

Nov 18 12:17:10 storemaster hdc: dma_intr: error=0x40 { UncorrectableError }, LBAsect=348709351, high=20, low=13165031, sector=348709351

Nov 18 12:17:10 storemaster end_request: I/O error, dev hdc, sector 348709351

Nov 18 12:17:11 storemaster hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }

Nov 18 12:17:11 storemaster hdc: dma_intr: error=0x40 { UncorrectableError }, LBAsect=348709359, high=20, low=13165039, sector=348709359

Nov 18 12:17:11 storemaster end_request: I/O error, dev hdc, sector 348709359

Nov 18 12:17:12 storemaster hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }

Nov 18 12:17:12 storemaster hdc: dma_intr: error=0x40 { UncorrectableError }, LBAsect=348709370, high=20, low=13165050, sector=348709367

Nov 18 12:17:12 storemaster end_request: I/O error, dev hdc, sector 348709367

Nov 18 12:17:14 storemaster hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }

Nov 18 12:17:14 storemaster hdc: dma_intr: error=0x40 { UncorrectableError }, LBAsect=348709379, high=20, low=13165059, sector=348709375

Nov 18 12:17:14 storemaster end_request: I/O error, dev hdc, sector 348709375

Nov 18 12:17:14 storemaster disk 1, o:0, dev:hdc1
```

I already checked both drives with the Maxtor PowerMax disc diagnoses tool, it gave different diagnostic codes for both drives with "failed" as result (#DEA6991 for hdi and #DE97987D for hdc).

The question is now, what are my chances, that I can recover the data, when the two replacement drives arrive. Both drives are not completely dead, which means I can run a check with hdparm on both drives, and I can also check the partition table with fdisk.

From what I can see in the syslog on a reboot, it looks really bad for /dev/hdi:

```
Nov 18 20:11:49 storemaster hdi: max request size: 1024KiB

Nov 18 20:11:49 storemaster hdi: 490234752 sectors (251000 MB) w/7936KiB Cache, CHS=30515/255/63, UDMA(133)

Nov 18 20:11:49 storemaster hdi: cache flushes supported

Nov 18 20:11:49 storemaster hdi: hdi1

Nov 18 20:11:49 storemaster hdi: dma_intr: status=0x51 { DriveReady SeekComplete Error }

Nov 18 20:11:49 storemaster hdi: dma_intr: error=0x40 { UncorrectableError }, LBAsect=490223295, high=29, low=3684031, sector=490223295

Nov 18 20:11:49 storemaster end_request: I/O error, dev hdi, sector 490223295

Nov 18 20:11:49 storemaster md: disabled device hdi1, could not read superblock.

Nov 18 20:11:49 storemaster md: hdi1 has invalid sb, not importing!
```

The situation for /dev/hdc is different:

```
Nov 18 20:31:58 storemaster hdc: max request size: 1024KiB

Nov 18 20:31:58 storemaster hdc: 490234752 sectors (251000 MB) w/7936KiB Cache, CHS=30515/255/63, UDMA(133)

Nov 18 20:31:58 storemaster hdc: cache flushes supported

Nov 18 20:31:58 storemaster hdc: hdc1

Nov 18 20:31:58 storemaster md: hdc1 has different UUID to sdb9

Nov 18 20:31:58 storemaster md: hdc1 has different UUID to sdb8

Nov 18 20:31:58 storemaster md: hdc1 has different UUID to sdb7

Nov 18 20:31:58 storemaster md: hdc1 has different UUID to sdb6

Nov 18 20:31:58 storemaster md: hdc1 has different UUID to sdb5

Nov 18 20:31:58 storemaster md: hdc1 has different UUID to sdb3

Nov 18 20:31:58 storemaster md: hdc1 has different UUID to sdb1

Nov 18 20:31:58 storemaster md:  adding hdc1 ...

Nov 18 20:31:58 storemaster md: bind<hdc1>

Nov 18 20:31:58 storemaster md: running: <hdk1><hdg1><hde1><hdc1><hda1>

Nov 18 20:31:58 storemaster md: kicking non-fresh hdc1 from array!

Nov 18 20:31:58 storemaster md: unbind<hdc1>

Nov 18 20:31:58 storemaster md: export_rdev(hdc1)
```

Of course /dev/md7 can not be assembled or run any more in this state, so there is nothing that can be analysed with mdadm ATM.

I hope, there are some raid specialists arround here, who can answe the following questions:

1. Can the superblock on /dev/hdi be restored in any way?

2. What would be the proper proceeding, when the two replacement drives arrive?

3. Would it make sense, to first copy the content of both drives sector-by-sector to the new drives, and could this be done?

4. Is there anything else, I could try or check in advance (as long as the replacement drives are not here)?

----------

## NeddySeagoon

Master One,

You chances of getting all your data back are close to zero.

Find dd_rhelp (its not in portage) and a machine you can install the new drive and a faulty drive in.

Run dd_rhelp to copy the faulty drive to the replacement. Do both drives.

dd_rhelp does a binary seach for good sectors on a damaged drive and recovers what it can.

It will gradually do more and more retries, so it never completes unless you are very lucky.

Put the recovered drives in your raid array and see how lucky you are. 

You will need to devise some sort of data integrety test.

Oh, and set up a cron job to send you a status email once a day, or just when when you get a drive fail.

----------

## Master One

Thank's for the feedback, NeddySeagoon. I can live with partial dataloss, but hopefully not all is lost. Already took a look at dd_rhelp, I guess this is the only way to go. I still can use that machine to perform the drive copying, because the OS is installed on a separate softraid-1 with two U160 SCSI discs. The great mystery is now, what exactly will happen, after I installed the two replacement drives with the copied content of the faulty ones. Will it be possible to assemble and run the array, and if not, what can be done in that case? If I am lucky, what sort of data integrity test would be available (or is it just fsck.reiserfs)?

The other thing, that's actually on my mind:

When trying to assemble the faulty raid in its actual state, it does not work out because of the missing superblock on /dev/hdi (the superblock on /dev/hdc is intact, so hdc most likely only suffers from some bad sectors).

Shouldn't I just replace /dev/hdi with the new replacement drive, and try to assemble and run the raid then?

Or would that attempt cause more harm than benefit?

Is there any chance, that it could work out that way, I mean, that /dev/hdi gets reconstructed, so that I then can swap the faulty /dev/hdc with the new replacenment drive, have a final reconstruction, and everthing is done?

Oh boy, what a terrible situation, having to sit here with the uncertainty if this will work out in any way at all, and having to wait until the replacement drives will arrive here next week...   :Crying or Very sad: 

If I get it going again, I certainly will use the mdadm reporting feature, so that such a mess will not happen again (or should I use smartmon for all IDE drives in addition to the mdadm reporting feature in daemon mode)?

----------

## NeddySeagoon

Master One,

You should be able to bring up your raid with one drive missing anyway. RAID is about drive redundancy

Your situation is complicated by the differing failure dates.

When the first drive failed it was not used and the system continued with degraded RAID operation.

You probably cannot use that drive (or a copy of it) to bring up the raid in degraded mode as the drive contents are not consistant with the rest of the raid set. Neither can the raid set be reconstructed with the data from the good drives, there is not enough data. You need to recover the data from the most recently failed drive, or as mucg as you can.

I don't knoe how you check your data integrty. Damage will be evident at two levels.

Where metadata is lost (sectors holding directory information) the files that these directory entries pointed to will be difficult to recover. you need to know hw reiserfs allocates disk spece.

Where  data sectors are lost, the data is gone.

In the case of unused sectors, it doesn't make any difference - there was nothing there to lose.

Its down to luck now. 

Do not operate the drive unless you really have to, if the platter bearings have failed it will get worse quite quickly.

----------

## Master One

Of course, I was so caught up in my first assumption, that both drives died at the same time yesterday, that I have completely overlooked the fact, that the dead drive from 25th October (/dev/hdi) does not matter at all any more.

So it is all about saving as much content as possible from yesterdays failing drive /dev/hdc, and this is what I am planning to do, once the two replacement drives arrived:

1. Copy the content of /dev/hdc to the first new drive using dd_rhelp.

2. Simply replace /dev/hdi with the second new drive, and partition it identical like the other drives (it's only one large primary type fd partition).

3. Assemble & run the raid.

The only uncertainty at this point is, if the raid can be assembed properly. With the defective /dev/hdi, mdadm complained, that the raid can not be assembled due to the missing superblock on /dev/hdi. But the new drive will also have no superblock (which is written when the raid is created, right?). So unless I am missing something here, mdadm will also complain about the missing superblock on the new drive, or will it somehow recognize, that it is a new drive and create the superblock itself (using the --add option in manage mode of mdadm)? I just want to be 100% sure, that I am not doing anything wrong, once the replacement drives are here.

Any further recommendations and ideas are highly appreciated.

----------

## NeddySeagoon

Master One,

You need to bring the raid up in degraded mode, with the partitions from the new drive missing, then raidhotadd them, so the redundant data is recreated.

You only fdisk the new drive. radihotadd can take a long time on a large partition but you can use the raid while it rebuilds like this.

With only one drive missing, the raid will form if its parts appear to be correct.

----------

## Master One

Thank's, NeddySeagoon, you are are great help. On my different machines I have one hardware-raid5, two software-raid1 and one software-raid5, but I never ran into any troubles until now, so I am not that confident dealing with the actual situation. I read the Linux Softraid FAQ, but the following question is still not properly answered:

Does the new replacement drive (for my defective /dev/hdi) have to be partitioned (and set to type 'fd' for linux autoraid) before it gets added to the raid?

The FAQ only tells to swap the defective drive and raidhotadd it. You mention, to fdisk the new drive. I am confused now.

----------

## NeddySeagoon

Master One,

Linux software raid does not raid whole drives, it makes raid systems from partitions.

This allows different raid levels to be hosted on the same drives. For example, my /boot is raid 1 but all my other partitions are raid 0. They are all on the same pair of drives.

When you raidhotadd, you will raidhotadd a partition, not a drive. The partition must be there for raidhotadd to find it, so you need to create it before you can tell raidhotadd about it. 

Silly thought for the day.

You could make a raid5 set from 3 or more partitions on the same drive. It may not be useful, other than for testing but the kernel won't mind.

----------

## j-m

 *NeddySeagoon wrote:*   

> Master One,
> 
> You need to bring the raid up in degraded mode, with the partitions from the new drive missing, then raidhotadd them, so the redundant data is recreated.
> 
> 

 

It's already been running in degraded mode since the first drive failed.

 *NeddySeagoon wrote:*   

> 
> 
> You only fdisk the new drive. radihotadd can take a long time on a large partition but you can use the raid while it rebuilds like this.
> 
> With only one drive missing, the raid will form if its parts appear to be correct.

 

No chance this would work w/ two drives failed.

----------

## Master One

Ok, understood now. Silly me, of course it is not possible to add a whole drive, but only a partition (should have had a clear mind before writing, but for me it was meant to be the whole drive, because I only have one large partition of each of those raid5 drives).

j-m, only the first drive (with missing superblock) is really gone, the second only seems to have some bad sectors, so I am speculating with the chance, that I can save it by using dd_rhelp to copy over the content to the second new replacement drive.

I only hope, that this will work out somehow, I'll let you know the result. But for now I have to wait, until the two replacement drives arrive (Maxtor is sending them from Hungary, hopefully they arrive within the next few days).

----------

## NeddySeagoon

j-m,

I'm well aware that Master One needs to recover the most recently failed drive to attempt to get the raid to start in degraded mode. If that works, and the data is worth saving, then the oldest failed drive can be replaced by raidhotadd.

I said earlier  *Quote:*   

> 
> 
> Neither can the raid set be reconstructed with the data from the good drives, there is not enough data. You need to recover the data from the most recently failed drive, or as mucg as you can.

 

----------

## Master One

BTW I'm going to use the mdadm email monitoring function in the future for my softraid-1 & softraid-5, which should prevent such a mess from happening ever again.

But I have no clue, how to remote monitor the hardware-raid-5 in my IBM eServer 226 with ServRaid 6i U320 SCSI-raid-controller.

Anyone any hint?

----------

## Master One

I have just been informed about a software called "HDD Regenerator", which is intended for the following purpose: *Quote:*   

> Program features 
> 
> Ability to detect physical bad sectors on a hard disk drive surface 
> 
> Ability to repair physical bad sectors (magnetic errors) on a hard disk surface 
> ...

 

Would that software provide a better chance to save my raiddrive with bad sectors (/dev/hdc), than just trying to copy the content using dd_rhelp?

Once again the objective:

From the 6 discs in that softraid-5, one is gone (/dev/hdi -> no superblock found), and a second one (/dev/hdc) is marked as "none-fresh" due to bad sectors.

Before the replacement drive for /dev/hdi can be hotadded to the raid-5, the raid will have to be started in degraded mode, which is currently not possibe due to /dev/hdc being marked as "none-fresh".

So my concern is now, that I have to somehow regenerate /dev/hdc, so that the raid can be assembled and run in degraded mode at all.

Would this be the best way of proceeding:

1. Copy content of defective /dev/hdc to a new drive using dd_rhelp

2. Run HDD REGENERATOR on defective /dev/hdc

3. Try to assemble and run the raid using the defective but regenerated /dev/hdc drive

4. If that fails, try to assemble and run the raid using the new /dev/hdc drive (with contains the copy of the content of the defective drive)

Or would it be better, to swap step 3 and 4?

Still missing info:

Is there a way, to mark /dev/hdc as "fresh" again, once it has been copied to the new drive, or been regenerated using that HDD REGENERATOR software, so that the raid can be assembled and started in degraded mode at all?

I mean, let's say /dev/hdc only suffers from a few bad sectors, and it could be successfully regenerated / copied to the new drive, I assume mdadm will still complain, that it's marked as "none-fresh", and therefor the raid can't be assembled and started.

I hope, someone can provide the missing info. I am really desperate.   :Crying or Very sad: 

(The two replacement drives still did not arrive here, but online-tracking confirms, they have left Maxtor Hungary yesterday)

----------

## NeddySeagoon

Master One,

"HDD Regenerator" appears to defy the laws of physics. First, the software running on your PC plays no part in reading the data from the disc platter. If the drive cant read it, it can't read it for any software. The advertising is misleading too. Its been about 10 years since you could do a low level format on a hard drive. One read only platter surface is given over to head servo and formatting information. There was a brief period when you could overwrite this information but it rendered the drive unusable, since the head servo information was lost.

When something appears too good to be true, it usually is.

I've lost the plot with which drive you need to recover - it needs to be the one that failed most recently, regardless of whats wrong with it.

Most filesystems store several copies of the superblock so if the primary one is damaged, it can still be repaired.

----------

## Master One

You are probably right, and dd_rhelp may already be the best bet, I'll see. I just don't want to miss out on any available chance, to somehow save as much as possible from the raid5 content. It would be really very sad, if all data would be lost, just because of some few bad sectors. I am not concerned about being unable to recover those bad sectors for their content, but to be able to assemble and run the raid at all.

BTW I am getting a little impatient by now, because my replacement drives should already have arrived days ago, but something went wrong during the transport (online-tracking shows, that the two parcels left Maxtor EMEA in Hungary on tuesday, but for an unknown reason they went to the Nederlands instead to Austria). It's been already a week now, that my terrabyte file-server is down...    :Sad: 

----------

## NeddySeagoon

Master One,

If you have had a spin motor bearing failure you may get data back if you operate the failed drive in unusual positions.

If it normally operates on edge try it on the opposite edge and any other stable position you like.

Sometimes making gravity move the spindle onto an unworn part of the bearing can make the drive read well enigh to recover your data. It only needs to work once per faulty sector.

Do not move the drive while its spinning unless you have recovered everything you expect to get the gyroscopic forces can do a lot of harm.

----------

## Master One

Good hint, may be worth a try. That drive was positioned normally, so I'll try the recovery with dd_rhelp with the drive positioned upside down (at least is should cause no further harm).

----------

## Master One

Damned, still stuck here. The replacement drives already arrived on last monday, but when I tried the lastest version of dd_rhelp with dd_rescue, it always stopped with the following error message after a while:

```
dd_rhelp: error: sources add_chunk : invalid argument '0-(info)' (is not correctly formatted as number:number)
```

I already contacted Valentin (dd_rhelp), as well as Kurt (dd_rescue) with the details and the logfile by email. Only Valentin replied until now, mentioning that the error is stating that there is incoherent info in the log file, but I did not receive any further reply since tuesday...   :Sad: 

Does anybody know of an alternative to dd_rhelp+dd_rescue?

The goal is just to copy the content of the defective drive to the new drive nevertheless the defective sectors (so a simple "dd" is a no-go, because it simply would stop when getting to the first none-readable sector).

----------

## NeddySeagoon

Master One,

You can use good old dd and tell it not to abort on errors, however thats very slow as it runs retries on bad blocks until it gives up and moves on. I don't know how to control the retries (maybe thats the kernel) without hacking the code.

dd_rhelp gives up on the first fail and tries somewhere else on the disk, so it recovers the maximum amount of data in the minimum time.

If you want to give dd a try, the man page tells how. You can also use skip and seek to skip block in the input and output streams to manually get over bad areas.

----------

## Master One

THIS IS UNBELIEVABLE!!! I REALLY DID IT!!!   :Very Happy: 

Ok, it's not completely my victory, but Valentin's (the creator of dd_rhelp). The problem was, dd_rhelp walked straight until about 160 GB, but then met the first set of bad sectors, and jumped to a point beyond the capacity of the disc. Instead of reporting eof, it led to some other feedback, and therefor to an error. The problem could easily be fixed, be deleting that sections from the dd_rhelp logfile, and setting the eof manually. After that, dd_rhelp kept walking and jumping through that drive (there were two areas with a few bad sectors) and indeed finished (!) after a short while (kept that machine unattended for some hours, when I came back, it simply was done).

I then reinstalled all drives, had to fiddle a little around with getting the raid to assemble ("mdadm --assemble /dev/md7 -Rf && mdadm --manage /dev/md7 -a /dev/hdi" finally did the job), and after a reconstruction time of about 3 hours all 6 drives were up and running, and it does not seem, that any data loss has occurred at all (after all 868 GB of that 1.25TB capacity).

What an amazing experience! I still can not believe, that I could indeed save that softraid-5 with 2 of 6 drives down!

Thank's a lot, NeddySeagoon, it would definitely not have worked out without your help (wouldn't have come to dd_rhelp without your hint).

What a wonderful day...   :Very Happy: 

----------

## muchtall

Please excuse my resurrection of an ancient thread.

Where did you find the value for eof? I tried some of the values shown in fdisk -l without success. Perhaps I'm not editing the logfile properly?

----------

## Master One

I am sorry, I don't remember how I got the value for eof in my case at that time, it's just too long ago. You should contact Valentin (the creator of dd_rhelp).

----------

