# can mdadm say if an array is broken ?

## doublehp

When a device is missing in an array, /proc/mdstat put's an "_" in the description. If only one device is missing in RAID5 or one or two in RAID6, the array can be fixed by adding a new volume. When doing "mdadm -a", recovery starts automatically.

When 2 are faulty in RAID5 or three in RAID6, mdadm outputs things the same way; the array can not be fixed by adding more drives, and DATA can not be fixed ever. After adding more drives, mdadm does NOT start recovery, and just put the devices as "spare". In this second case, data are lost forcever, and this point is not clear at all ...

neither in proc/mdstat nor in mdadm -D ...

Did I miss something ? or is mdadm just unable to "measure" this ?

I understand that at boot process, kernel adds volumes one by one to arrays, so that, during a few milisenconds, arrays are in degraded mode untill all elements of the array are found. I understand that degraded mode is acceptable for a few seconds at boot time, and thus, it is an essantial, and "not so alarming state" for an array. Still, I would like mdadm or the kernel to tell me "right now, at once, the array is not usable, broken, and <<if you don't add blocks pretty fast, and performa any write on the array, you will loose data forever>> ".

----------

## drescherjm

 *Quote:*   

> When 2 are faulty in RAID5 or three in RAID6, mdadm outputs things the same way; the array can not be fixed by adding more drives, and DATA can not be fixed ever. After adding more drives, mdadm does NOT start recovery, and just put the devices as "spare". In this second case, data are lost forcever, and this point is not clear at all ... 

 

How are you loosing so many drives? This is definitely not normal. In 6 years I have not lost a single software raid 5 or 6 array out of dozens. If your drives are just being kicked out of the array for some reason (like loose power connector, bad sata cable, bad sata controller ...) and the drives are still in working condition (enough to duplicate) you can duplicate the missing drives and force the array to use the out of sync member if needed. If the cause of being kicked out of the array is not a bad drive and then no need to copy the drive just force mdadm to assemble. There may be some inconsistencies but that is better than loosing everything.

 *Quote:*   

> I understand that at boot process, kernel adds volumes one by one to arrays, so that, during a few milisenconds, arrays are in degraded mode untill all elements of the array are found.

 

This is wrong. The array usually is not started until all members are found. 

 *Quote:*   

> Still, I would like mdadm or the kernel to tell me "right now, at once, the array is not usable, broken, and <<if you don't add blocks pretty fast, and performa any write on the array, you will loose data forever>> ". 

 

You can monitor this via several ways. The mdadm daemon can be setup to email you when the array becomes degraded. Or there are other programs that monitor the arrays like nagios.

Also if you have these kinds of problems often add a spare. On > 10 arrays (over 50 hard drives) that have run for the last 3 to 6 years 24/7 I have replaced 3 drives total.

----------

## doublehp

I only used RAID 0 and RAID1 in the past; I just bought 4 drives this morning, and doing heavy tests, to understand how mdadm works, and what my computer will say me when one fails, then a second one. Just doing simulation, using mdadm -f -r for now.

Still, if I do -f vol4 -f vol3 -r vol4 -r vol3 ... proc/mdstat say that everything is normal. While I would expect some "error", warning, or explicit message about possible data loss. IMHO, it should show a graphical difference between "your are loosing redundancy", and "you have lost redundancy" and "your data are corrupt". Mdadm -D is not really better than mdstat.

----------

## drescherjm

 *Quote:*   

> Still, if I do -f vol4 -f vol3 -r vol4 -r vol3 ... proc/mdstat say that everything is normal. 

 

That is not normal. The status (/proc/mdstat) normally tells you as soon as the drive is kicked out of the array. 

It does this by indicating missing drives _ and it can also email you of that.

Although I have not played around with software failing of drives/members in years.

----------

## drescherjm

 *Quote:*   

> IMHO, it should show a graphical difference between "your are loosing redundancy", and "you have lost redundancy" and "your data are corrupt".

 

Now this type of message I have not seen with any program.

However as an admin you should address a single drive failure as soon as you can.

----------

## doublehp

I just says as you expect [UU__] ... and *I* have to know that, if the erray is RAID5, then it's dead, if RAID6, I have to hurry up.

Unless messages are removed because I remove members manually ? maybe things should be different in case kernel detects failts by itself ? still, before letting me marking a volume as faulty, it should warn me. Because, if just before I mark it faulty, an other volume breaks ... I may end of rapidly with a broken system.

Mdadm really seems to have absolutely no backup or security against PEBCAK, or human mistakes.

----------

## doublehp

mdadm should send me email/sms automatically ? where do i set this ?

----------

## drescherjm

 *Quote:*   

> I just says as you expect [UU__] 

 

This is the same output for hardware failures. Again you can have madam email you that there is a problem or have nagios monitor your array for you and also email you. 

If you are looking for this to tell you that having 1 _ in RAID 5 is one step from data loss or 2 _ in raid 6 the software does not spell it out that clearly. As an admin you are supposed to know that..

I think webmin will also show you this.

----------

## drescherjm

 *doublehp wrote:*   

> mdadm should send me email/sms automatically ? where do i set this ?

 

In your mdadm.conf

also rember to start the mdadm daemon.

----------

## drescherjm

Also the /proc/mdstat will say degraded on the array that does not have all of its drives.

----------

## doublehp

I never used the conf file; always let magic dothings. Can't mdadm record the email to send directly in the drives ? or make the monitoring daemon do it without declaring arrays in the conf ?

I tried several times to declare the drives in the conf, and always got troubles.

----------

## drescherjm

There is some good info here:

http://en.gentoo-wiki.com/wiki/RAID/Software

----------

## doublehp

 *drescherjm wrote:*   

> Also the /proc/mdstat will say degraded on the array that does not have all of its drives.

 

No, it did not; not even that. Unless it's a kernel/mdadm version problem. I am ATM using an old system (stable Debian); I will have a better one in two days (stable Gentoo).

----------

## drescherjm

I may be wrong about it saying degraded in the /proc/mdstat. I do know it does send out emails though.

----------

## drescherjm

BTW. Here are examples of the email nagios can send:

```
***** Nagios 2.10 *****

Notification Type: PROBLEM

Service: Linux Raid Status for md1

Host: dev6

Address: dev6.radimg.pitt.edu

State: CRITICAL

Date/Time: Wed Sept 17 15:39:30 EDT 2008

Additional Info:

CRITICAL md1 status=[UUU_].
```

```
***** Nagios 2.10 *****

Notification Type: PROBLEM

Service: Linux Raid Status for md2

Host: dev6

Address: dev6.radimg.pitt.edu

State: WARNING

Date/Time: Thu Sept 18 13:08:20 EDT 2008

Additional Info:

WARNING md2 status=[UUU_], recovery=80.2%, finish=29.5min.

```

----------

## doublehp

 *drescherjm wrote:*   

> BTW. Here are examples of the email nagios can send:

 

That's nice  :Smile:  Thanks.

----------

## Monkeh

 *doublehp wrote:*   

> I just says as you expect [UU__] ... and *I* have to know that, if the erray is RAID5, then it's dead, if RAID6, I have to hurry up.

 

I know, it's horrible, you have to understand how it works.

----------

## doublehp

 *Monkeh wrote:*   

>  *doublehp wrote:*   I just says as you expect [UU__] ... and *I* have to know that, if the erray is RAID5, then it's dead, if RAID6, I have to hurry up. 
> 
> I know, it's horrible, you have to understand how it works.

 

When I have a full page with half, or a dozen of raid chains, I will not count the "_" for every single them, and check if this number is acceptable depending on the raid type of the array. A very simple graph could be very nicely explicit:

- - array is fully operationnal

- / some redundancy is missing

- ! state is critical, you don't have any more redundancy

- ? you lost too many drives, data are not recoverable (unless you can introduce a drive that you are sure is sync with this arrays, and if the array is RO)

To show that a very simple sign, a single letter, put just after the [UUUU__]  (in proc/mdstat , and respective full descriptions in mdadm -D ) could easily show what I am talking about. Mdadm says "degraded" in any case.

For example, 3 missing drives in RAID1 5 drives should just be "/". Just to illustrate: [UU___] /

The mail system proposed by drescherjm seems nice to me.

----------

## drescherjm

Here is some info on where to begin for nagios:

http://www.gentoo.org/doc/en/nagios-guide.xml

----------

