# RAID problem after major upgrade

## nw_biohazard

Hi,

I just attempted a major upgrade on a system configured with RAID1 and seem to have uncovered a problem with the two drives not being sync'd.  After finally completing an "emerge --update --newuse --deep world" and associated revdep-rebuild, I rebooted the system. It appears that when it came back up, it is running on a disk that has been idle for months.  I have backups and should be able to restore most of the data but I'd like to recover if possible.

I'm in a bit over my head here.  Any clues how to proceed? Any help would be greatly appreciated.

Here is some output

```

localhost ~ # cat /proc/mdstat

Personalities : [raid1]

md126 : active raid1 sdb3[1]

      1003968 blocks [2/1] [_U]

     

md4 : active raid1 sda4[0]

      151203712 blocks [2/1] [U_]

     

md1 : active raid1 sdb1[1] sda1[0]

      72192 blocks [2/2] [UU]

     

md3 : active raid1 sda3[0]

      1003968 blocks [2/1] [U_]

     

md127 : active raid1 sdb4[1]

      151203712 blocks [2/1] [_U]

     

unused devices: <none> 

```

And 

```

# /etc/fstab: static file system information.

#

# noatime turns off atimes for increased performance (atimes normally aren't

# needed; notail increases performance of ReiserFS (at the expense of storage

# efficiency).  It's safe to drop the noatime options if you want and to

# switch between notail / tail freely.

#

# The root filesystem should have a pass number of either 0 or 1.

# All other filesystems should have a pass number of 0 or greater than 1.

#

# See the manpage fstab(5) for more information.

#

# <fs>                  <mountpoint>    <type>          <opts>          <dump/pass>

# NOTE: If your BOOT partition is ReiserFS, add the notail option to opts.

/dev/md1                /boot           ext2            noauto,noatime  1 2

/dev/md3                /               ext3            noatime         0 1

/dev/sda2               none            swap            sw,pri=1        0 0

/dev/sdb2               none            swap            sw,pri=1        0 0

/dev/vg/usr             /usr            ext3            noatime         1 2

/dev/vg/portage         /usr/portage    ext2            noatime         1 2

/dev/vg/distfiles       /usr/portage/distfiles  ext2    noatime         1 2

/dev/vg/home            /home           ext3            noatime         1 2

/dev/vg/opt             /opt            ext3            noatime         1 2

/dev/vg/tmp             /tmp            ext2            noatime         1 2

/dev/vg/var             /var            ext3            noatime         1 2

/dev/vg/vartmp          /var/tmp        ext2            noatime         1 2

#/dev/cdrom             /mnt/cdrom      auto            noauto,ro       0 0

#/dev/fd0               /mnt/floppy     auto            noauto          0 0

/dev/cdrecorder         /mnt/cdrecorder auto            noauto,user     0 0

/dev/dvd                /mnt/dvd        auto            noauto,user     0 0

# glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for

# POSIX shared memory (shm_open, shm_unlink).

# (tmpfs is a dynamically expandable/shrinkable ramdisk, and will

#  use almost no memory if not populated with files)

shm                     /dev/shm        tmpfs           nodev,nosuid,noexec     0 0 

```

----------

## BradN

I'm not familar with mdstat output, can you do mdadm --detail /dev/md* ?

Also what does dmesg show for raid output when the system boots?

And... what's your raid partition arrangement supposed to be?

For example, md0 = /dev/hda2, /dev/hdb2

It kinda looks from your output that your individual raid partitions are getting assembled into separate md devices instead of one like it should be?  I'm not sure though because mdstat isn't really my thing.

----------

## nw_biohazard

```

localhost ~ # mdadm --detail /dev/md*

mdadm: /dev/md does not appear to be an md device

/dev/md1:

        Version : 0.90

  Creation Time : Sun Jul 27 09:54:42 2008

     Raid Level : raid1

     Array Size : 72192 (70.51 MiB 73.92 MB)

  Used Dev Size : 72192 (70.51 MiB 73.92 MB)

   Raid Devices : 2

  Total Devices : 2

Preferred Minor : 1

    Persistence : Superblock is persistent

    Update Time : Sun Jul 25 15:20:49 2010

          State : clean

 Active Devices : 2

Working Devices : 2

 Failed Devices : 0

  Spare Devices : 0

           UUID : a6420c7f:0471687c:ed99c6d7:b255b8d8

         Events : 0.4

    Number   Major   Minor   RaidDevice State

       0       8        1        0      active sync   /dev/sda1

       1       8       17        1      active sync   /dev/sdb1

/dev/md126:

        Version : 0.90

  Creation Time : Sun Jul 27 09:55:06 2008

     Raid Level : raid1

     Array Size : 1003968 (980.60 MiB 1028.06 MB)

  Used Dev Size : 1003968 (980.60 MiB 1028.06 MB)

   Raid Devices : 2

  Total Devices : 1

Preferred Minor : 126

    Persistence : Superblock is persistent

    Update Time : Sun Jul 25 15:20:49 2010

          State : clean, degraded

 Active Devices : 1

Working Devices : 1

 Failed Devices : 0

  Spare Devices : 0

           UUID : d8dfa9e3:5f473395:95e66932:06bea8e0

         Events : 0.10

    Number   Major   Minor   RaidDevice State

       0       0        0        0      removed

       1       8       19        1      active sync   /dev/sdb3

/dev/md127:

        Version : 0.90

  Creation Time : Sun Jul 27 09:55:26 2008

     Raid Level : raid1

     Array Size : 151203712 (144.20 GiB 154.83 GB)

  Used Dev Size : 151203712 (144.20 GiB 154.83 GB)

   Raid Devices : 2

  Total Devices : 1

Preferred Minor : 127

    Persistence : Superblock is persistent

    Update Time : Sun Jul 25 15:20:49 2010

          State : clean, degraded

 Active Devices : 1

Working Devices : 1

 Failed Devices : 0

  Spare Devices : 0

           UUID : 1ad9a949:5ff17671:6dc8d47b:cf21a2a2

         Events : 0.8652

    Number   Major   Minor   RaidDevice State

       0       0        0        0      removed

       1       8       20        1      active sync   /dev/sdb4

/dev/md3:

        Version : 0.90

  Creation Time : Sun Jul 27 09:55:06 2008

     Raid Level : raid1

     Array Size : 1003968 (980.60 MiB 1028.06 MB)

  Used Dev Size : 1003968 (980.60 MiB 1028.06 MB)

   Raid Devices : 2

  Total Devices : 1

Preferred Minor : 3

    Persistence : Superblock is persistent

    Update Time : Sun Jul 25 15:21:04 2010

          State : active, degraded

 Active Devices : 1

Working Devices : 1

 Failed Devices : 0

  Spare Devices : 0

           UUID : d8dfa9e3:5f473395:95e66932:06bea8e0

         Events : 0.8137

    Number   Major   Minor   RaidDevice State

       0       8        3        0      active sync   /dev/sda3

       1       0        0        1      removed

/dev/md4:

        Version : 0.90

  Creation Time : Sun Jul 27 09:55:26 2008

     Raid Level : raid1

     Array Size : 151203712 (144.20 GiB 154.83 GB)

  Used Dev Size : 151203712 (144.20 GiB 154.83 GB)

   Raid Devices : 2

  Total Devices : 1

Preferred Minor : 4

    Persistence : Superblock is persistent

    Update Time : Sun Jul 25 15:20:50 2010

          State : clean, degraded

 Active Devices : 1

Working Devices : 1

 Failed Devices : 0

  Spare Devices : 0

           UUID : 1ad9a949:5ff17671:6dc8d47b:cf21a2a2

         Events : 0.5748126

    Number   Major   Minor   RaidDevice State

       0       8        4        0      active sync   /dev/sda4

       1       0        0        1      removed

```

dmesg

```

localhost ~ # dmesg | grep raid

[    0.997064] md: raid1 personality registered for level 1

[    1.614849] md: If you don't use raid, use raid=noautodetect

[    1.692575] raid1: raid set md127 active with 1 out of 2 mirrors

[    1.694387] raid1: raid set md3 active with 1 out of 2 mirrors

[    1.695744] raid1: raid set md1 active with 2 out of 2 mirrors

[    1.696784] raid1: raid set md4 active with 1 out of 2 mirrors

[    4.894045] raid1: raid set md126 active with 1 out of 2 mirrors

```

I set up RAID as follows:

```

mknod /dev/md1 b 9 1

mknod /dev/md3 b 9 3

mknod /dev/md4 b 9 4

livecd ~# mdadm --create /dev/md1 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1

livecd ~# mdadm --create /dev/md3 --level=1 --raid-devices=2 /dev/sda3 /dev/sdb3

livecd ~# mdadm --create /dev/md4 --level=1 --raid-devices=2 /dev/sda4 /dev/sdb4

ivecd ~# cat /proc/mdstat 

mdadm --assemble /dev/md1 /dev/sda1 /dev/sdb1

mdadm --assemble /dev/md3 /dev/sda3 /dev/sdb3

mdadm --assemble /dev/md4 /dev/sda4 /dev/sdb4

nano -w /etc/lvm/lvm.conf

replace filter line with 

filter = [ "a|/dev/md[1234]|", "r/.*/"]

vgscan

vgchange -a y

livecd ~# pvcreate /dev/md4

livecd ~# vgcreate vg /dev/md4

lvcreate -L10G -nusr vg

lvcreate -L2G  -nportage vg

lvcreate -L4G  -ndistfiles vg

lvcreate -L50G -nhome vg

lvcreate -L5G  -nopt vg

lvcreate -L4G  -nvar vg

lvcreate -L6G  -nvartmp vg

lvcreate -L2G  -ntmp vg

mke2fs /dev/md1

mke2fs -j /dev/md3

mke2fs -b 4096 -T largefile /dev/vg/distfiles

mke2fs -j /dev/vg/home

mke2fs -j /dev/vg/opt

mke2fs -b 1024 -N 200000 /dev/vg/portage

mke2fs /dev/vg/tmp

mke2fs -j /dev/vg/usr

mke2fs -j /dev/vg/var

mke2fs /dev/vg/vartmp

mkswap /dev/sda2 && mkswap /dev/sdb2

swapon -p 1 /dev/sda2 && swapon -p 1 /dev/sdb2

swapon -v -s

mount /dev/md3 /mnt/gentoo

cd /mnt/gentoo

mkdir boot home usr opt var tmp

mount /dev/md1 /mnt/gentoo/boot

mount /dev/vg/usr /mnt/gentoo/usr

mount /dev/vg/home /mnt/gentoo/home

mount /dev/vg/opt /mnt/gentoo/opt

mount /dev/vg/tmp /mnt/gentoo/tmp

mount /dev/vg/var /mnt/gentoo/var

mkdir usr/portage var/tmp

mount /dev/vg/vartmp /mnt/gentoo/var/tmp

mount /dev/vg/portage /mnt/gentoo/usr/portage

mkdir /usr/portage/distfiles

mount /dev/vg/distfiles /mnt/gentoo/usr/portage/distfiles

chmod 1777 /mnt/gentoo/tmp /mnt/gentoo/var/tmp

```

----------

## BradN

Hmm, I've never heard of linux software raid forgetting what md devices raid partitions belong to, but it seems that's what happened here.

It seems md126 and md127 are what it assigned to the raid partitions it couldn't figure out.

I'd recommend for the two raids that got split this way, find out which device for each has a better copy of your data (if no writes were done since they got split, they're probably the same)

Then, --fail all the devices except the ones in your md1 raid, but including md126 and 127.  After failing the devices, remove them all with -r.

Reassemble md3 and md4 like you did when you created them, but use only one device for each - use the 'missing' keyword in place of the other device - this lets us enforce which partition's contents get used.

For example, if /dev/sdb3 is the better copy of the two for md3, do a command like:

# mdadm --create /dev/md3 --level=1 --raid-devices=2 missing /dev/sdb3

Then re-add the other device:

mdadm -a /dev/md3 /dev/sda3

Do the same steps for md4, and hopefully the configuration holds.

BTW, kudos on keeping your setup commands saved somewhere - this is a really smart thing to do with raid, even more so if you use non-standard options.

----------

## nw_biohazard

Ok, thanks.  I'll give that a try. If I can recover the data, then I'll be more confident about playing around and seeing if I can fix it up.

I'll let you know how it goes.

----------

## BradN

Biggest thing to keep in mind is that as long as you don't overwrite the data, it won't be lost - so you can create and remove the raids all you want and it won't affect the data, but once you add the second device, it'll get overwritten with the contents of the first.  So I guess, check all your data before you -a the 2nd partition.

----------

## nw_biohazard

I'm having trouble removing them:

```

localhost home # mdadm /dev/md127 -f /dev/sdb4

mdadm: set /dev/sdb4 faulty in /dev/md127

localhost home # mdadm --detail /dev/md127

/dev/md127:

        Version : 0.90

  Creation Time : Sun Jul 27 09:55:26 2008

     Raid Level : raid1

     Array Size : 151203712 (144.20 GiB 154.83 GB)

  Used Dev Size : 151203712 (144.20 GiB 154.83 GB)

   Raid Devices : 2

  Total Devices : 1

Preferred Minor : 127

    Persistence : Superblock is persistent

    Update Time : Sun Jul 25 18:03:44 2010

          State : clean, degraded

 Active Devices : 1

Working Devices : 1

 Failed Devices : 0

  Spare Devices : 0

           UUID : 1ad9a949:5ff17671:6dc8d47b:cf21a2a2

         Events : 0.8866

    Number   Major   Minor   RaidDevice State

       0       0        0        0      removed

       1       8       20        1      active sync   /dev/sdb4

localhost home # mdadm /dev/md127 -r /dev/sdb4

mdadm: hot remove failed for /dev/sdb4: Device or resource busy

localhost home # mdadm /dev/md127 -f /dev/sdb4 -r /dev/sdb4

mdadm: set /dev/sdb4 faulty in /dev/md127

mdadm: hot remove failed for /dev/sdb4: Device or resource busy

localhost home #

```

----------

## BradN

I might be a little rusty on my mdadm-fu, make sure lvm is shut down on top of them (and of course make sure any filesystems are unmounted).

If that isn't enough, maybe there's a different command to shut down a md device, check --help.

----------

