# Emergency mdadm help... [solved]

## gr0x0rd

Came home yesterday, sat down at my desk, went to close a tab in Firefox and poof. App disappreared. Tried relaunching but got an "Input/Output error". I should have been more diligent this point, but I decided to simply reboot to check my BIOS. After finding nothing in there (all devices show up), I tried to load Gentoo, but ran into a kernel panic message. It couldn't find my root partition.

I have 4x 2TB drives in a RAID5 config for my root. My swap is also a RAID5, and (naturally) my boot partition is RAID1. 

After booting with a Gentoo minimal CD, I can see my arrays...

```
livecd ~ # cat /proc/mdstat

Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 

md125 : active (auto-read-only) raid5 sdb2[1] sdd2[3] sdc2[2] sda2[0]

      3144192 blocks level 5, 512k chunk, algorithm 2 [4/4] [UUUU]

      

md126 : inactive sdb3[1](S) sdc3[2](S) sdd3[3](S) sda3[0](S)

      7809597440 blocks

       

md127 : active (auto-read-only) raid1 sdc1[2] sdb1[1] sdd1[3] sda1[0]

      65472 blocks [4/4] [UUUU]

      

unused devices: <none>
```

md125 is actually md2, md126 is actually md3, and md127 is actually md1 respectively. So it seems something deactivated my root array, and since it's not in an active state, the system can't boot. If I try to activate it 

```
livecd ~ # mdadm -A /dev/md126

mdadm: /dev/md126 not identified in config file.
```

I am unable to do so, as my mdadm.conf is stored within the md126 array (my root filesystem)!

How do I activate this array and resync it so I can boot my system again? I'd start blazing my own trail here, but most of the documentation I've found assumes you have access to your root filesystem, which I don't, and I'm kind of nervous about destroying my data...   :Shocked:  I've resynced an array after losing a drive etc, but this is the first time I've run into an array that deactivated while running, and of course all my logs etc are stored on that same array so I have no way to investigate what went wrong.

Any sage advice would be much appreciated! Thanks in advance.

----------

## NeddySeagoon

gr0x0rd,

Look at all of the partitions that should be in your root array with mdadm -E /dev/sda3 and so on it it might tell the problem.

There may also be useful info in dmesg.

If none of that helps, mdadm -A ...  by hand should produce a failure message with some useful info.

If the assembly fails, mdadm will tell you why.  Do not use the force option with mdadm yet.

----------

## gr0x0rd

Thanks for the reply Neddy.

```
livecd ~ # mdadm -E /dev/sda3

/dev/sda3:

          Magic : a92b4efc

        Version : 0.90.00

           UUID : 14bcf749:0f71556c:cb201669:f728008a (local to host livecd)

  Creation Time : Mon Apr 18 18:58:11 2011

     Raid Level : raid5

  Used Dev Size : 1952399360 (1861.95 GiB 1999.26 GB)

     Array Size : 5857198080 (5585.86 GiB 5997.77 GB)

   Raid Devices : 4

  Total Devices : 4

Preferred Minor : 3

    Update Time : Wed Nov 16 01:03:07 2011

          State : clean

 Active Devices : 2

Working Devices : 2

 Failed Devices : 2

  Spare Devices : 0

       Checksum : a08926cf - correct

         Events : 544422

         Layout : left-symmetric

     Chunk Size : 512K

      Number   Major   Minor   RaidDevice State

this     0       8        3        0      active sync   /dev/sda3

   0     0       8        3        0      active sync   /dev/sda3

   1     1       0        0        1      faulty removed

   2     2       8       35        2      active sync   /dev/sdc3

   3     3       0        0        3      faulty removed

```

So it looks like two of the drives failed simultaneously, which brought down the array. The cleaning lady was vaccuuming at the time... maybe some sort of power surge? Lousy desktop UPS... 

```
livecd ~ # mdadm --assemble /dev/md126 --verbose /dev/sda3 /dev/sdb3 /dev/sdc3 /dev/sdd3  

mdadm: looking for devices for /dev/md126

mdadm: cannot open device /dev/sda3: Device or resource busy

mdadm: /dev/sda3 has no superblock - assembly aborted
```

As suggested, I won't force it yet. Some relevant dmesg output:

```
scsi 4:0:0:0: Direct-Access     ATA      WDC WD20EARS-00M 51.0 PQ: 0 ANSI: 5

sd 4:0:0:0: [sda] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)

sd 4:0:0:0: [sda] Write Protect is off

sd 4:0:0:0: [sda] Mode Sense: 00 3a 00 00

sd 4:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

scsi 4:0:1:0: Direct-Access     ATA      WDC WD20EARS-00M 51.0 PQ: 0 ANSI: 5

scsi 5:0:0:0: Direct-Access     ATA      WDC WD20EARS-00M 51.0 PQ: 0 ANSI: 5

sd 5:0:0:0: [sdc] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)

sd 5:0:0:0: [sdc] Write Protect is off

sd 5:0:0:0: [sdc] Mode Sense: 00 3a 00 00

sd 5:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

scsi 5:0:1:0: Direct-Access     ATA      WDC WD20EARS-00M 51.0 PQ: 0 ANSI: 5

scsi 7:0:0:0: CD-ROM            HL-DT-ST DVDRAM GH22LS50  TL00 PQ: 0 ANSI: 5

sr0: scsi3-mmc drive: 48x/48x writer dvd-ram cd/rw xa/form2 cdda tray

cdrom: Uniform CD-ROM driver Revision: 3.20

sr 7:0:0:0: Attached scsi CD-ROM sr0

sd 5:0:1:0: [sdd] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)

 sdc: sdc1 sdc2 sdc3

sd 5:0:1:0: [sdd] Write Protect is off

sd 5:0:1:0: [sdd] Mode Sense: 00 3a 00 00

sd 5:0:1:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

sd 4:0:1:0: [sdb] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)

 sda: sda1 sda2 sda3

sd 4:0:1:0: [sdb] Write Protect is off

sd 4:0:1:0: [sdb] Mode Sense: 00 3a 00 00

sd 4:0:1:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

 sdb: sdb1 sdb2 sdb3

sd 4:0:0:0: [sda] Attached SCSI disk

sd 4:0:1:0: [sdb] Attached SCSI disk

 sdd: sdd1 sdd2 sdd3

sd 5:0:0:0: [sdc] Attached SCSI disk

sd 5:0:1:0: [sdd] Attached SCSI disk

sd 4:0:0:0: Attached scsi generic sg0 type 0

sd 4:0:1:0: Attached scsi generic sg1 type 0

sd 5:0:0:0: Attached scsi generic sg2 type 0

sd 5:0:1:0: Attached scsi generic sg3 type 0

...

md: bind<sda1>

md: bind<sda3>

md: bind<sdd1>

md: bind<sda2>

md: bind<sdb1>

md: bind<sdc2>

md: bind<sdd3>

md: bind<sdc3>

md: bind<sdb3>

md: bind<sdc1>

bio: create slab <bio-1> at 1

md/raid1:md127: active with 4 out of 4 mirrors

md127: detected capacity change from 0 to 67043328

 md127: unknown partition table

md: bind<sdd2>

md: bind<sdb2>

md/raid:md125: device sdb2 operational as raid disk 1

md/raid:md125: device sdd2 operational as raid disk 3

md/raid:md125: device sdc2 operational as raid disk 2

md/raid:md125: device sda2 operational as raid disk 0

md/raid:md125: allocated 4272kB

md/raid:md125: raid level 5 active with 4 out of 4 devices, algorithm 2

RAID conf printout:

 --- level:5 rd:4 wd:4

 disk 0, o:1, dev:sda2

 disk 1, o:1, dev:sdb2

 disk 2, o:1, dev:sdc2

 disk 3, o:1, dev:sdd2

md125: detected capacity change from 0 to 3219652608

 md125: unknown partition table
```

So all of the drives are being picked up, which is good, but md126 is ignored during the binding process, mostlikely because if its state.

What could be tying up /dev/sda3? I tried the command leading with sdb3 and got the same result.

*Edit* 

I was able to free the devices by issuing

```
mdadm --stop /dev/md126
```

After this, 

```
livecd ~ # mdadm --assemble /dev/md126 --verbose /dev/sdb3 /dev/sda3 /dev/sdc3 /dev/sdd3

mdadm: looking for devices for /dev/md126

mdadm: /dev/sdb3 is identified as a member of /dev/md126, slot 1.

mdadm: /dev/sda3 is identified as a member of /dev/md126, slot 0.

mdadm: /dev/sdc3 is identified as a member of /dev/md126, slot 2.

mdadm: /dev/sdd3 is identified as a member of /dev/md126, slot 3.

mdadm: added /dev/sdb3 to /dev/md126 as 1

mdadm: added /dev/sdc3 to /dev/md126 as 2

mdadm: added /dev/sdd3 to /dev/md126 as 3

mdadm: added /dev/sda3 to /dev/md126 as 0

mdadm: /dev/md126 assembled from 2 drives - not enough to start the array.
```

Obviously there is something I need to do to bring the drives that died. From mdadm -E I can see the states of the drives... 

```
drive      state     update time

sda3     clean      Wed Nov 16 01:03:07 2011

sdb3     active     Tue Nov 15 22:33:33 2011

sdc3     clean      Wed Nov 16 01:03:07 2011

sdd3     active     Tue Nov 15 22:33:33 2011
```

Should I try to force the assembly...?

Thanks,

----------

## gr0x0rd

The top post when googling "mdadm recover raid5" has this comment at the bottom of the thread:

 *Quote:*   

> Hello , since this seems to be a popular thread in google i thought i'd share my solution to the problem.
> 
> Instead of recreating the array (possibility for errors if you set the wrong chunk size , or wrong devices), I just ran the below command -
> 
> mdadm --assemble /dev/md5 --uuid={UUID-of-raid5-array} --force
> ...

 I think this is my next best option, but I'll wait for a second opinion (or more). How do I find the UUID of the array if it was in my mdadm.conf in the array itself? Hopefully I can recover this UUID from one of the drives in the array.

Nervous...

----------

## NeddySeagoon

gr0x0rd,

Now its a case of do you feel lucky ...  if not, get 4 more drives and image your raid, you may be glad of the undo facility.

You can attempt to force the raid to assemble with the two active drives and one of the failed ones.  This will update the raid superblocks, if it works.

As one drive is missing, you won't have any redundant information, so mounting the filesystem read only with journal reppay off would allow you to look around and recover files.

You can't tell if the files are intact without actually looking at them.

You can also run fsck in its don't fix anything mode, to see what it thinks about the filesystem metadata.

If you try to assemble with the other failed drive, you may get differenet answers.

The problem is that the raid set is not consistent. The kernel has spotted that but it cannot tell you whats wrong and how much data is damaged. 

Read man mdadm for --force ...

```
       -f, --force

              Assemble the array even if the metadata on some devices  appears

              to  be out-of-date.  If mdadm cannot find enough working devices

              to start the array, but can find some devices that are  recorded

              as  having failed, then it will mark those devices as working so

              that the array can be started.  An array which requires  --force

              to be started may contain data corruption.  Use it carefully.
```

Hence look at the combinations of degraded mode and read only.  You don't want anything to be synced until you have had the chance to look around.

----------

## gr0x0rd

Do I feel lucky?

The cruise control in my truck died on the drive home yesterday, my RAID array died, and then my tenant called to let me know his fridge wasn't working. Given bad things always happen in threes...  :Smile: 

My only remaining question is regarding the UUID. mdadm shows 

```
 UUID : 14bcf749:0f71556c:cb201669:f728008a (local to host livecd)
```

Let me know if this is ok to use, or if I need to somehow recover the UUID from my previous mdadm.conf.

But before I do anything, I'm going to go out and get good and drunk, spend some time with the girlfriend and have a good night's sleep. Thanks for the advice sir.

----------

## NeddySeagoon

gr0x0rd

 *gr0x0rd wrote:*   

>  ... I'm going to go out and get good and drunk, spend some time with the girlfriend and have a good night's sleep.

  Thats always a good plan.

Then allow your hangover to clear before you go near the computer.

You can use any UUID you like - mdadm will not attempt to assemble the raid if the UUID on the compnents and the UUID given in the command line don't match.

The UUID is a permanent feature of the raid set, assigned when its created.  If you get the UUID wrong, mdadm will complain.

mdadm -E /dev/<component> will show you the UUID of the raid set.

Note the filesystem will have a UUID too, it will be different to the raid set UUID (or it wouldn't be Univerally Unique) and is fixed at filesystem create time.

Whatever you decide to do, think it through first as you may only have one try.

----------

## gr0x0rd

The array reassembled without issue. Held my breath as my system booted... and seemingly without issue.

I am leaving the system idle for the time being and allowing the array to re-sync. I'll be doing some damage assessment later today after this has completed.

Thanks for all your help Neddy.

Cheers,

----------

## NeddySeagoon

gr0x0rd,

I hope its all good.

----------

## gr0x0rd

Good news: I've been up and running for 9h 30m without issue. All services running. All databases, documents, precious files etc seem to be just fine. I really didn't have any write operations (other than logs and background services etc) running from the time of the failure onward so...

Any tools to recommend to do any recovery checks? I know I should run fsck but obviously I can't do this while the system is running.  :Smile: 

Other than that... marking thread as solved. And a good day to you, sir!

----------

## NeddySeagoon

gr0x0rd,

fsck probably doesn't do what you think it does but running it in read and don't fix won't do any harm.

fsck looks at the metadata on the filesystem and ensures that the metadata is self consistant. Thats all.

The metadats is the information the filesystem keeps for itself to locate your data.

There are no tools that can tell you if your own data is good or not.

----------

