# [SOLVED] mdadm RAID1 - replacing a failed drive

## cami

So just one day after I had my RAID setup up (see [SOLVED] How to properly boot a custom initramfs?), a disk failed permanently. I've installed an identical replacement, but I cannot figure out how to make it being used. The idea behind using RAID was to make this easy, but I tried really hard and I only find more and more questions instead of answers.

I initially created a full-disk RAID-1 on two identical disks using intel storage manager (X58 chipset).

```
mdadm --examine /dev/sda

/dev/sda:

          Magic : Intel Raid ISM Cfg Sig.

        Version : 1.1.00

    Orig Family : 1d385601

         Family : 83eb12c3

     Generation : 00005e60

     Attributes : All supported

           UUID : eb43e025:cff7929e:9af766e7:f2d60015

       Checksum : d65c6715 correct

    MPB Sectors : 2

          Disks : 2

   RAID Devices : 1

  Disk01 Serial : PK1134P6JWDGUW

          State : active

             Id : 00010000

    Usable Size : 3907023112 (1863.01 GiB 2000.40 GB)

[Volume0]:

           UUID : 083b0d35:d926f293:ef50839b:4f023f76

     RAID Level : 1 <-- 1

        Members : 2 <-- 2

          Slots : [UU] <-- [__]

    Failed disk : 1

      This Slot : 1 (out-of-sync)

     Array Size : 3907022848 (1863.01 GiB 2000.40 GB)

   Per Dev Size : 3907023112 (1863.01 GiB 2000.40 GB)

  Sector Offset : 0

    Num Stripes : 15261808

     Chunk Size : 64 KiB <-- 64 KiB

       Reserved : 0

  Migrate State : rebuild

      Map State : degraded <-- degraded

     Checkpoint : 0 (512)

    Dirty State : dirty

  Disk00 Serial : 134P6JVNVHW:0:0

          State : active failed

             Id : ffffffff

    Usable Size : 3907022936 (1863.01 GiB 2000.40 GB)
```

The last lines represent the failed disk. It doesn't physically exist anymore. The other disk (  Disk01 Serial : PK1134P6JWDGUW) is attached as /dev/sda. The new drive is attached as /dev/sdb, but not used in any way yet.

```
NAME          MAJ:MIN  RM  SIZE RO TYPE  MOUNTPOINT

sda             8:0     0  1,8T  0 disk  

└─md_d127     254:8128  0  1,8T  0 raid1 

  ├─md_d127p1 254:8129  0 1023M  0 md    

  ├─md_d127p2 254:8130  0   31G  0 md    [SWAP]

  └─md_d127p3 254:8131  0  1,8T  0 md    /

sdb             8:16    0  1,8T  0 disk  
```

Intel storage manager UI only lets me create or delete arrays, but not replace drives. So I have to do this using mdadm somehow. I already ran

```
mdadm --manage /dev/md127 --remove failed
```

It exited without saying anything. I'm not sure whether it did something.

The first thing I do not understand is why there is a separate "container" (md127) and an "array" (md_d127), what each of these are and when to use which. Most sources on the net have just one "md0". Documentation on containers is very brief. 

The second thing I do not understand is the output of /proc/mdstat, mdadm --detail and mdadm --examine. Documentation doesnt explain very well what the differences are and how to interpret the output. As far as I understood --examine reads a data block from the physical drives. Couldn't figure out what --detail and /proc/mdstat do.

```
cat /proc/mdstat

Personalities : [raid1] [raid10] [raid6] [raid5] [raid4] 

md_d127 : active raid1 sda[0]

      1953511424 blocks super external:/md127/0 [2/1] [U_]

      

md127 : inactive sda[0](S)

      3028 blocks super external:imsm

       

unused devices: <none>
```

```
mdadm --detail /dev/md127 

/dev/md127:

        Version : imsm

     Raid Level : container

  Total Devices : 1

Working Devices : 1

           UUID : eb43e025:cff7929e:9af766e7:f2d60015

  Member Arrays : /dev/md/Volume0_0

    Number   Major   Minor   RaidDevice

       0       8        0        -        /dev/sda
```

```
mdadm --detail /dev/md_d127

/dev/md_d127:

      Container : /dev/md/imsm0, member 0

     Raid Level : raid1

     Array Size : 1953511424 (1863.01 GiB 2000.40 GB)

  Used Dev Size : 1953511556 (1863.01 GiB 2000.40 GB)

   Raid Devices : 2

  Total Devices : 1

          State : active, degraded 

 Active Devices : 1

Working Devices : 1

 Failed Devices : 0

  Spare Devices : 0

           UUID : 083b0d35:d926f293:ef50839b:4f023f76

    Number   Major   Minor   RaidDevice State

       0       8        0        0      active sync   /dev/sda

       2       0        0        2      removed
```

I can add and remove /dev/sdb to the container /dev/md127, but that doesn't seem to affect the actual array.

```
cami ~ # cat /proc/mdstat 

Personalities : [raid1] [raid10] [raid6] [raid5] [raid4] 

md_d127 : active raid1 sda[0]

      1953511424 blocks super external:/md127/0 [2/1] [U_]

      

md127 : inactive sdb[1](S) sda[0](S)

      6056 blocks super external:imsm

       

unused devices: <none>

cami ~ # lsblk 

NAME          MAJ:MIN  RM  SIZE RO TYPE  MOUNTPOINT

sda             8:0     0  1,8T  0 disk  

└─md_d127     254:8128  0  1,8T  0 raid1 

  ├─md_d127p1 254:8129  0 1023M  0 md    

  ├─md_d127p2 254:8130  0   31G  0 md    [SWAP]

  └─md_d127p3 254:8131  0  1,8T  0 md    /

sdb             8:16    0  1,8T  0 disk  

cami ~ # mdadm --detail /dev/md127 

/dev/md127:

        Version : imsm

     Raid Level : container

  Total Devices : 2

Working Devices : 2

           UUID : eb43e025:cff7929e:9af766e7:f2d60015

  Member Arrays : /dev/md/Volume0_0

    Number   Major   Minor   RaidDevice

       0       8        0        -        /dev/sda

       1       8       16        -        /dev/sdb

cami ~ # mdadm --detail /dev/md_d127

/dev/md_d127:

      Container : /dev/md/imsm0, member 0

     Raid Level : raid1

     Array Size : 1953511424 (1863.01 GiB 2000.40 GB)

  Used Dev Size : 1953511556 (1863.01 GiB 2000.40 GB)

   Raid Devices : 2

  Total Devices : 1

          State : active, degraded 

 Active Devices : 1

Working Devices : 1

 Failed Devices : 0

  Spare Devices : 0

           UUID : 083b0d35:d926f293:ef50839b:4f023f76

    Number   Major   Minor   RaidDevice State

       0       8        0        0      active sync   /dev/sda

       2       0        0        2      removed

cami ~ # mdadm --examine /dev/sdb

/dev/sdb:

          Magic : Intel Raid ISM Cfg Sig.

        Version : 1.0.00

    Orig Family : 00000000

         Family : e3724720

     Generation : 00000001

     Attributes : All supported

           UUID : 00000000:00000000:00000000:00000000

       Checksum : 01a96b92 correct

    MPB Sectors : 1

          Disks : 1

   RAID Devices : 0

  Disk00 Serial : PK1134P6JVNVHW

          State : spare

             Id : 03000000

    Usable Size : 3907026958 (1863.02 GiB 2000.40 GB)

    Disk Serial : PK1134P6JVNVHW

          State : spare

             Id : 03000000

    Usable Size : 3907026958 (1863.02 GiB 2000.40 GB)

```

The raid contains the root filesystem, so it's not easy to stop/reassemble th array, albeit possible (using a boot CD). I hoped for an easy solution. Easy replacement was the idea behind the setup, after all. But so far I haven't found any solution at all that doesn't require recreating the array and losing the data.Last edited by cami on Wed Jul 27, 2016 12:05 pm; edited 2 times in total

----------

## frostschutz

Is there a windows on this machine? For Linux it's best to stick to the native format, not using any intel storage manager.

You don't have to remove failed. Just ignore it.

My guess is that you need `mdadm /dev/md_d127 --add /dev/sdb` but I could be wrong because I don't use imsm format.

----------

## cami

Well I already issued the suggested command without achieving the desired result (see OP for details).

I already noticed imsm might not have been the best choice but now I'm kind of stuck with it.

----------

## frostschutz

You're not really showing that in your post... and you only talk of adding to md127 not md_dangnabbit127

If that doesn't work, could you show output for file -s and parted print for each disk?

```

for disk in /dev/sda* /dev/sdb* /dev/md* /dev/md*/*

do

    file -sL "$disk"

    parted "$disk" unit s print free

done

```

----------

## cami

Oh sorry, I overlooked that bit. It is not possible to --add the array directy, mdadm says i shall add to the container  instead.

I will post the output of the requested commands tomorrow. Note however that it's full-disk raid. I also included the outputs of mdadm --examine of the two disks in the OP, maybe that helps for the time being.

----------

## frostschutz

 *cami wrote:*   

>  Note however that it's full-disk raid. 

 

There are currently two threads on the linux-raid mailing list by people who destroyed their RAID due to it being a full-disk raid. ( http://www.spinics.net/lists/raid/msg53033.html http://www.spinics.net/lists/raid/msg53046.html )

Their mistake: They partitioned their full disk RAID with GPT, then ran a partitioner on ... the full disk.

Partitioner sees GPT data at either start or end of the disk (GPT keeps a backup at the end), and restores/rebuilds the "missing" GPT on the other end of the disk - and there goes your RAID metadata bye-bye.

I never do full-disk RAID, or full-disk anything for that matter, there's just too many ways for it to go wrong in unexpected ways. Always use a partition table.

My suggestion for you is to bite the bullet and do it over. If your current RAID is still working, you can use sdb to build a new structure from scratch, this time with a traditional disk -> partitions -> md -> filesystem structure.

----------

## cami

Thanks for your advice. I already noticed the setup choices might not have been the best.

For completeness, I did not do anything fancy with the disks, I only swapped the failed drive. The RAID is still working, only degraded. So this is basically the standard situation RAID is designed for.

I strongly doubt it has anything to do with partitions however, and that I would have the exact same problem if it were sda1 and sdb1 instead. 

 So I'm still looking for a proper solution without starting over. If the only solution would be starting over, RAID 1 would be pointless, and a standard backup would be more efficient. So could we assume it's unrelated to partitioning and pretend the raid is on an individual partition?

----------

## cami

Update. I was able to solve the problem today, although I still don't understand what happened. Here's what I did:

Booted the system using a Gentoo LiveCD

Noticed that the LIVECD found two containers (imsm0 and imsm1) and one volume (Volume0_0)

```
$ ls /dev/md

Volume0_0 imsm0 imsm1
```

Found that Volume0_0 was using container imsm1

```
mdadm --detail /dev/md/Volume0_0
```

Checked metadata on /dev/sda and /dev/sdb (see OP for outputs)

```
$ mdadm --examine /dev/sda

...

$ mdadm --examine /dev/sdb

...
```

Observed that /dev/sda contained the Intel Storage Manager metadata for my RAID, with the first disk missing and the second disk being /dev/sda itself. (see also OP)

Observed that /dev/sdb contained Intel Storage Manager metadata for a spare without any assigned volume (see also OP)

Assumed that container imsm0 consisted of the spare /dev/sdb

Stopped container /dev/md/imsm0

```
mdadm --manage /dev/md/imsm0 --stop
```

Added /dev/sdb to container /dev/md/imsm1

```
mdadm --manage /dev/md/imsm1 --add /dev/sdb
```

I could hear that this started a rebuild. 

I don't know why this didn't work while the system was running, somehow mdadm must have added /dev/sdb to a new container instead of the one I specified. The LiveCD and my system use different versions of mdadm, maybe a bug?

```
# mdadm --version # this is the potentially buggy version                                                                     

mdadm - v3.3.1 - 5th June 2014
```

Checked what was going on

```
# cat /proc/mdstat                                                                                                            

Personalities : [raid1] [raid10] [raid6] [raid5] [raid4]                                                                             

md125 : active raid1 sdb[1] sda[0]                                                                                                 

      1953511424 blocks super external:/md126/0 [2/1] [_U]                                                                           

      [==>..................]  recovery = 10.8% (212502848/1953511556) finish=223.3min speed=129900K/sec                             

                                                                                                                                     

md126 : inactive sda[1](S) sdb[0](S)                                                                                                 

      6056 blocks super external:imsm                                                                                                

                                                                                                                                     

unused devices: <none> 
```

Checked disk metadata

```
cami ~ # mdadm --examine /dev/sda                                                                                                    

/dev/sda:                                                                                                                            

          Magic : Intel Raid ISM Cfg Sig.                                                                                            

        Version : 1.1.00                                                                                                             

    Orig Family : 1d385601                                                                                                           

         Family : 5a6ea771                                                                                                           

     Generation : 00005e95                                                                                                           

     Attributes : All supported                                                                                                      

           UUID : eb43e025:cff7929e:9af766e7:f2d60015                                                                                

       Checksum : 60a930ae correct                                                                                                   

    MPB Sectors : 2                                                                                                                  

          Disks : 2                                                                                                                  

   RAID Devices : 1

  Disk01 Serial : PK1134P6JWDGUW

          State : active

             Id : 00010000

    Usable Size : 3907023112 (1863.01 GiB 2000.40 GB)

[Volume0]:

           UUID : 083b0d35:d926f293:ef50839b:4f023f76

     RAID Level : 1 <-- 1

        Members : 2 <-- 2

          Slots : [UU] <-- [_U]

    Failed disk : 0

      This Slot : 1

     Array Size : 3907022848 (1863.01 GiB 2000.40 GB)

   Per Dev Size : 3907023112 (1863.01 GiB 2000.40 GB)

  Sector Offset : 0

    Num Stripes : 15261808

     Chunk Size : 64 KiB <-- 64 KiB

       Reserved : 0

  Migrate State : rebuild

      Map State : normal <-- degraded

     Checkpoint : 787874 (512)

    Dirty State : dirty

  Disk00 Serial : PK1134P6JVNVHW

          State : active

             Id : 00030000

    Usable Size : 3907023112 (1863.01 GiB 2000.40 GB)

cami ~ # mdadm --examine /dev/sdb

/dev/sdb:

          Magic : Intel Raid ISM Cfg Sig.

        Version : 1.1.00

    Orig Family : 1d385601

         Family : 5a6ea771

     Generation : 00005e95

     Attributes : All supported

           UUID : eb43e025:cff7929e:9af766e7:f2d60015

       Checksum : 60a930ae correct

    MPB Sectors : 2

          Disks : 2

   RAID Devices : 1

  Disk00 Serial : PK1134P6JVNVHW

          State : active

             Id : 00030000

    Usable Size : 3907023112 (1863.01 GiB 2000.40 GB)

[Volume0]:

           UUID : 083b0d35:d926f293:ef50839b:4f023f76

     RAID Level : 1 <-- 1

        Members : 2 <-- 2

          Slots : [UU] <-- [_U]

    Failed disk : 0

      This Slot : 0 (out-of-sync)

     Array Size : 3907022848 (1863.01 GiB 2000.40 GB)

   Per Dev Size : 3907023112 (1863.01 GiB 2000.40 GB)

  Sector Offset : 0

    Num Stripes : 15261808

     Chunk Size : 64 KiB <-- 64 KiB

       Reserved : 0

  Migrate State : rebuild

      Map State : normal <-- degraded

     Checkpoint : 787874 (512)

    Dirty State : dirty

  Disk01 Serial : PK1134P6JWDGUW

          State : active

             Id : 00010000

    Usable Size : 3907023112 (1863.01 GiB 2000.40 GB)
```

Tested the array by mounting the partitions and accessing some files and directories.

```
 # lsblk

NAME          MAJ:MIN  RM  SIZE RO TYPE  MOUNTPOINT

sda             8:0     0  1,8T  0 disk  

└─md125       254:8128  0  1,8T  0 raid1 

  ├─md125p1   254:8129  0 1023M  0 md    

  ├─md125p2   254:8130  0   31G  0 md    [SWAP]

  └─md125p3   254:8131  0  1,8T  0 md    /

sdb             8:16    0  1,8T  0 disk  

└─md125       254:8128  0  1,8T  0 raid1 

  ├─md125p1   254:8129  0 1023M  0 md    

  ├─md125p2   254:8130  0   31G  0 md    [SWAP]

  └─md125p3   254:8131  0  1,8T  0 md    /

# mount /dev/md/Volume0_0p3 /mnt/gentoo # /dev/md/Volume0_0p3 symlinks to /dev/md125p3

# ...

# umount /mnt/gentoo
```

I didn't want to wait for recovery to finish, so I stopped the array, checked everything was offline, checked metadata again, and rebooted.

```
mdadm --manage /dev/md/Volume0_0 --stop

mdadm --manage /dev/md/imsm1 --stop

cat /proc/mdstat # this should say "no devices"

mdadm --examine /dev/sda

mdadm --examine /dev/sdb

reboot
```

During boot, Intel Storage Manager showed the RAID with both disks attached and state "Rebuild" (i.e. recovery), and the system came up normally.

----------

