# MDADM RAID5 Recovery and corruption detection suggestions

## Cyker

Argh!

Had a worst-case scenario with my server; Came home and found it totally unresponsive - Rebooted it and the BIOS was showing that Disk2 of my RAID array had failed.

"Meh" I thought, swapped it out with a spare and booted up the system.

Some of you will probably guess what happened next, but basically the RAID wouldn't assemble.

It turns out if you have a disk failure which also crashes the system, the RAID array becomes both inconsistent and degraded and mdadm will refuse to start it up as there is a possibility of data corruption! (The infamous "cannot start dirty degraded array" issue...)

(For those of you that ever run into this, the correct procedure to get around the dirty degraded array is:

1) Replace failed disk

2) Run mdadm --stop --scan

3) Run mdadm --assemble --run --force /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1

It took me several hours of forum trawling to distill this out of countless overly-verbose and unintelligible posts...  :Razz: 

The answer is actually also in the kernel md.txt doc, but it doesn't mention the very important step of running "mdadm --stop --scan" first; If anything is set to auto-mount your RAID, the --assemble command won't work until the array is stopped again and instead throws up scary and misleading errors about the RAID elements but only a handful of people mention this very important fact!)

Now, my main issue is that of the whole RAID5 write-hole thing, which I'm guessing enters into this?

Silent data corruption is bad, so while the RAID is resyncing, I'm running fsck to check it.

fsck successfully recovered the journal and mopped up some orphans, and is now check'ing the entire 6TB data space as I type this (It typically takes over 8 hours... ;_; )

Are there any other checks I can run to mitigate/discover/fix any data corruption that may have occurred?

My biggest fear is of silent undetectable data creeping into the drive; It makes me wish btrfs and zfs were proven so I can migrate away from ext4 to something with error detection and correction!

Oh, and happy new year!  :Very Happy: 

----------

## eccerr0r

This is exactly the problem with the raid5 write hole... if you partially write, a disk fails, and the power goes off (or system crashes), you can't tell what the correct data should be when doing a rebuild.

Now why did the disk failure crash the system?  Need to make sure the controllers stop responding a disk connected to it stops working.

I have had disk failures on my raid5 before but the system kept on going... it should...

----------

## Cyker

Yeah, coulda-shoulda-woulda....

It looks like the actual drive electronics broke, as when I reset the machine, it took a long time to detect SATA2 (The port the dead drive was connected to) and sometimes detected it as a Hard Disk and sometimes detected it as Not connected. If it detected it, it would then report it as a failed disk and if you looked at it in the BIOS the drive model was there but capacity, geometry, supported transfers etc speed etc. were all reported as 0.

I've seen things like this before when testing USB flash drives with faulty controllers  in Windows -  On plugging in and attempting to access the flash drive,  the whole system would lock up almost instantly with maxed kernel CPU (Well, you see that in task manager just before all input stops being registered and the screen freezes!). No input response and screen frozen.

I have also had it when trying to read certain copy-protected CDs and DVDs in my old Plextor drive (It was a 760A, which was EXTREMELY intolerant of out-of-spec discs and would lock up the IDE bus in some kind of wait mode trying to read them!)

In both cases, unplugging the drive/ejecting the disc  allowed the system to recover. Oddly it never actually caused Windows to BSOD, just lock until it was unplugged.

I suspect that, in this case, if I had unplugged that drive, the system would have also recovered, but obviously I had no way of knowing which drive was responsible!

I too have had disk failures before, but in these cases it was the actual disk/motor which failed, triggered SMART and many many errors, and was kicked off the array so i could safely shut down the system to replace it.

This is the first time I've had a disk fail like this on this system and cause it to freeze with the HDD light on solid!

smartd logs don't list anything up to the point of failure so whatever it was it was very sudden. This was an RE4-GP too, enterprise-grade with a 5 year warranty and manufactured pre-disaster!

Journal recovery and fsck fixed a few things so hopefully nothing else was corrupted; Am running a diff between it and the backup to see ifI can spot any corruption, but I feel this is mostly an exercise in wasting electricity as anything that was damaged is probably stuff that was in use at the time and thus not backed up!

This uncertainty is not a good feeling... Roll on btrfs/zfs!

----------

## Cyker

Hmm, it also seems like the write-intent bitmap has disappeared; Is that normal when replacing a disk?!

(Would that have even helped with the 'write hole' problem?)

----------

## eccerr0r

External or Internal write-intent bitmap?

I don't think I'm using one because I never created one.  I suppose I got lucky, my disk failures tend to occur during reads :o

In any case, I really wonder if the added cost of saving additional metadata outweighs performance and disk space...

----------

## Cyker

If you cat /proc/mdstat it should tell you if you have one.

My one would have been internal as that was the easiest to set up! 

Doesn't take up much space but apparently has an impact on write performance, but as the bottleneck is Samba, which benches an order of magnitude slower than the array even over a gigabit link (Seriously, there must be a better network filesystem...), the performance penalty isn't really noticed and with an array this big, anything to reduce sync time is welcome!

(That said I always resync and fsck the array after reboots and that can often take a day so it's probably not really saving me any time! At least it lets the array recover faster so it doesn't hold up the boot process so much...)

----------

## eccerr0r

I only resync when needed or every month in the wee hours of morning when I'm sleeping, hoping that exercises my disks enough.  And I guess I don't have a write intent bitmap.  My RAID5 is 4-disk+1hotspare 500GB and is always on, and I think it's slow as heck especially through my VMs.  Not sure why.  Sometimes when the system goes down unexpectedly it decides to resync on its own,then for the next few hours the machine is dreadfully slow...

----------

## Cyker

Yeah, the SW RAID5 puts a lot of I/O and CPU load on the system, and if you're trying to do other high-load things on it (e.g. running a VM!) the performance drop is pretty steep. I know AMD systems were much better than Intel systems for this originally because of hypertransport vs GTL, but since Intel started copying the AMD bus they're about on par I think.

My issue is mine is basically a fileserver, but the gigabit card doesn't seem to have any sort of acceleration to off-load some of the work, and because I have a normal network the MTU is stuck at 1500, so at full tilt the poor CPU is getting absolutely hammered by Samba, RAID5 calcs and network packet processing!

I'm surprised there isn't some sort of GPU off-load for the RAID5 parity calculations TBH as that could speed things up a lot!

Then again can you trust something that critical to a GPU...?

----------

## frostschutz

 *Cyker wrote:*   

> I'm surprised there isn't some sort of GPU off-load for the RAID5 parity calculations TBH as that could speed things up a lot!

 

```

[    0.070887] xor: automatically using best checksumming function:

[    0.080279]    avx       : 28824.000 MB/sec

[    0.098292] raid6: sse2x1   10207 MB/s

[    0.115300] raid6: sse2x2   12835 MB/s

[    0.132311] raid6: sse2x4   15000 MB/s

[    0.149322] raid6: avx2x1   19878 MB/s

[    0.166331] raid6: avx2x2   22929 MB/s

[    0.183340] raid6: avx2x4   26777 MB/s

[    0.183424] raid6: using algorithm avx2x4 (26777 MB/s)

[    0.183509] raid6: using avx2x2 recovery algorithm

```

----------

## depontius

 *Cyker wrote:*   

> 
> 
> My biggest fear is of silent undetectable data creeping into the drive; It makes me wish btrfs and zfs were proven so I can migrate away from ext4 to something with error detection and correction!
> 
> 

 

I don't get this.  I read the Ars Technica article too, about how btrfs and zfs "heal" data that might otherwise be silently accumulating bit-rot.

Here's the trouble...I've always been under the impression that hard disks have stream-oriented ECC - in fact, occasionally my smart logs have things like "ECC Recovered" messages in them.  I also thought that the drive electronics kept track of this kind of thing and reallocated sectors, as needed.  I've been under the impression that a hard drive would not allow silent corruption to occur, unless it was some sort of multi-bit error that would creep through the ECC algorithms.  Given that, and given that any Linux redundancy-oriented RAID knows which blocks map together, I would have thought that, for instance, RAID1 could easily heal the bad side with the good side.  It may know nothing about the files, but it certainly knows the blocks.

By the way, I run RAID1 at home, and have had drive electronics fail, too.

----------

## Goverp

 *depontius wrote:*   

> ...
> 
> Ars Technica article too, about how btrfs and zfs "heal" data that might otherwise be silently accumulating bit-rot.
> 
> ...

 

IMHO this sounds suspiciously like marketing hype:

"Even as you read this, your vital bodily fluids are being corrupted ! ! ! 

ONLY DR STRANGELOVE'S CRYSTAL TUNED QUANTUM WATER CAN SAVE YOU."

Does anyone have figures to show how often btrfs or zfs find and fix errors?  Anyway, my consumer-grade PC doesn't even have parity checked memory, let alone ECC.  Without that, who cares about the file system.

----------

## Cyker

Hard disks do have their own ECC but it wouldn't help here; If you write bogus data to a HDD, the HDD's ECC will dutifully make sure that it stays bogus data  :Wink: 

The HDD has no idea whether a given chunk of data is valid or not - It just takes the data chunk, calculates ECC for it and stores it. It couldn't tell whether e.g. a zip file has been corrupted when it was written - It only understands blocks of 512 (Or now 4096) bytes. HDD ECC is mainly designed to stop errors creeping in from the inherently fuzzy nature of current high density magnetic disk storage.

The brtfs/zfs ECC is higher up the chain and will hopefully notice file-level corruption as it will understand what a file is and can calculate its ECC accordingly. In my case, it would help mitigate the RAID5 write-hole as, if some RAID stripes were corrupted, the file data would be wrong and the filesystem would notice.

Frankly, this is something that should have been implemented in file-systems decades ago. (Seriously, how many times have you cursed floppy disks back in the day due to data being corrupted?  :Laughing: )

The problem with RAID5 is it isn't designed very well; The only checksum is separate from the rest of the data stripes so if it gets corrupted or is lost due to disk failure then there is NO checksum.

If, as in my case, the RAID has lost its parity AND is not shut down cleanly, it has NO IDEA whether data is valid or not, even tho' it can regenerate strips and parity data from the remaining disks - It has nothing to compare them against.

Worse, if some data WAS corrupted and the array is rebuilt, this corrupt data is now considered uncorrupted by the RAID, and you have no way of knowing unless you have some other way of detecting such corruption (e.g. diff'ing against backup, tripwire, filesystem ECC, etc.)

With an ECC-capable filesystem, the filesystem itself can check and repair any damage. For instance, I was able to detect some damage in various files I was torrenting - qbittorrent was able to isolate the corrupted blocks and re-download them as the checksums didn't match, and I was also able to verify some RAR archives were still valid as that also supports ECC.

(On a side note, c'mon 7zip, where's our ECC support?? RAR is catching up with it's new file-format that supports ludicrously large dictionary sizes!)

The rest of my ext4 filesystem did some journal recovery but ultimately, as ext4 has no ECC ability, it doesn't know if any of the files are damaged either...

----------

## depontius

 *Cyker wrote:*   

> Hard disks do have their own ECC but it wouldn't help here; If you write bogus data to a HDD, the HDD's ECC will dutifully make sure that it stays bogus data 

 

I was more thinking of data getting bit-rot while sitting on a drive in a RAID, which is what Ars Technica was talking about.  Assume for the moment that correct data was written onto the drive, then I don't see it silently getting corrupted.  In a relevant fashion, yesterday I looked up RAID scrubbing, and there are indeed mechanisms to carve out some background bandwidth and do so, as well as repair if needed.

I'm not criticizing the desirability of filesystem ECC - I just think they might have been overstating the case of how bad a hard drive is at doing its job.

----------

## Cyker

Ah, get ya, thought we were talking about my RAID issue still; Got the wrong end of the proverbial stick there  :Smile: 

The undetectable data corruption I was talking about was specifically due to the whole potential 'RAID5 write-hole thing' from my recent crash, not random bit errors (As you say, they are extremely unlikely; You are more likely to suffer data-loss from disk damage than an uncorrectable bit error I suspect!), which is where the confusion lies I think.

Ars may have overstated it - They do tend to dramatise things (I do miss sites like Ace's Hardware because of things like that!) - But it's always nice to have filesystem-level ECC as that would help in both cases  :Smile: 

I really need to find a better way to back up this array tho'...

----------

## Ralphred

 *Cyker wrote:*   

> I really need to find a better way to back up this array tho'...

 

Read the rest of the article, as much as on the fly consistency checking is appealing, I have a tendency to make things consistent, but wrong. The differential style backups appeal the the tinkerer in me...

On the subject of raid 5, am I the only one who tries to find drives of varying ages to build them from, so the MTtF of each drive is offset enough to cope with the stress of a failed drive rebuild?

----------

## frostschutz

 *Ralphred wrote:*   

> On the subject of raid 5, am I the only one who tries to find drives of varying ages to build them from, so the MTtF of each drive is offset enough to cope with the stress of a failed drive rebuild?

 

Same time failure is possible, it's just unlikely. Disks of varying ages do not really help here. Every disk will see different wear&tear, due to different access patterns (even in a RAID), different temperature, different vibrations in the case etc. etc., so many variables involved that in the end the failure time is random.

Defective sectors can go undetected for a very long time, and most people do not regularly test their drives. In such a setup, when the first drive fails completely and you rebuild, that rebuild being the first read test you ever did, of course you run into the defective sectors the other drives have had for months. That doesn't mean the drives failed at the same time, it just means you didn't notice.

As for stress of a rebuild, there is none. A rebuild is a linear read/write, apart from being idle that's the least stressful operation a HDD can do. And mdadm is designed to reduce rebuild speed on disk usage.

Rebuilds don't stress disks, databases do (if they're busy), and everything else that does random I/O 24/7. There is no reason why a rebuild should be particularly deadly to your disks. If that was the case, you'd see HDD failure in DB servers on a daily basis.

----------

## eccerr0r

sometimes I wonder with on-filesystem ECC and comparing that with doing RAID5 on a single disk (multiple partitions).  Granted yes, performance will tank because of seek times, but based on the fact that usually when disks fail, a whole swath of disk fails near the written portion that the failed swath will likely be contained within a partition, and your data lives on due to raid5...  

Really the in-filesystem ECC just tries to add another layer of ECC protection that the hard drive does already.  If it can keep the redundant data away from the real data, there might be value, but yes as in the single disk/multiple partition raid5 there's a speed penalty writing the redundant data away from the main data...

Speaking of which I do wonder how much loss there is with the 512-B to 4096-B sectors and how hard drive manufacturers are reducing ECC bits to increase space...  Sometimes sounds like cheating here and we will eventually require ECC filesystems.

Then again the HDD still has to maintain a 0-bad block disk to the system...

----------

## depontius

 *frostschutz wrote:*   

> 
> 
> Same time failure is possible, it's just unlikely. Disks of varying ages do not really help here. Every disk will see different wear&tear, due to different access patterns (even in a RAID), different temperature, different vibrations in the case etc. etc., so many variables involved that in the end the failure time is random.

 

A while back I had a friend in systems support who said quite the opposite.  The temperature and vibration environments are common elements that pretty much swamp out any differences caused by access patterns.  Getting same-age drives increases the chances that those drives are of the same build vintage, increasing the likelihood that they will have similar sensitivites and responses to aging.

Skipping the theoretical and back to the real world.  They dreaded power fails most, followed by planned power-downs, followed by drive failures.  Any time you took a system that had been running 24x7x365 and stopped it, there was a very chance it wouldn't power back up, and disk drives were a big part of that.  As for the drive failure, there was an "enhanced probability" that after replacing the failed drive, when they powered back up a second drive would fail.  Backups, backups, backups.

 *frostschutz wrote:*   

> 
> 
> Defective sectors can go undetected for a very long time, and most people do not regularly test their drives. In such a setup, when the first drive fails completely and you rebuild, that rebuild being the first read test you ever did, of course you run into the defective sectors the other drives have had for months. That doesn't mean the drives failed at the same time, it just means you didn't notice.

 

I just learned about RAID scrubbing this week.  Fortunately both RAID1s that I have running currently are clean.  I plan to tuck this into a weekly cron job when I get around to it.  By the way, I have SMART running a short self-test daily and a long self-test weekly.  (I've had SMART report impending failures before, and been able to plan corrective action.  Once for my mother's computer 600+ miles away, that included travel.)

----------

## frostschutz

 *depontius wrote:*   

> Any time you took a system that had been running 24x7x365 and stopped it, there was a very chance it wouldn't power back up, and disk drives were a big part of that.

 

Sure, it's possible for drives to lose the ability to spin up at some point. The question is, when does that happen? At the same time, really? At random times, but since no one goes as far as sending one disk at a time to standby once a month just to see if it would manage to wake up again, it's an undetected failure in a 24/7/365 server?

 *depontius wrote:*   

> Backups, backups, backups.

 

Yep. You need those either way, lots of ways to lose data with a perfectly working RAID.

 *depontius wrote:*   

> By the way, I have SMART running a short self-test daily and a long self-test weekly.

 

I run a daily selective self-test instead, which tests only a fraction of the disk but covers the entire disk over the course of X days, X depending on the size of the selective region size you choose.

The problem with the long tests is that they take very long to complete (since HDD became so large), on a busy server it might be > 24 hours. With selective tests you can schedule it for the times your server is most idle instead.

----------

## frostschutz

while on the topic of bitrot and raid5, kernel 3.13 has a bug with raid5/6 that makes it oops and freeze, total system halt, unsaved data lost, resync on reboot:

https://lkml.org/lkml/2014/1/19/115

haven't tried the patch in that discussion yet, hopefully it will be fixed in 3.13.1

----------

