# Shrunk RAID instead of adding disk; data recovery? [solved]

## gregp01

(See below for original post with details.) Well, this is embarrassing. In a fit of spectacular stupidity, I used mdadm --grow -z instead of mdadm --grow -n. I now have three 4 kB RAID arrays. Do I have any hope of recovering my data? In theory it's all still there, possibly minus some small portion that was corrupted during the change. I'm guessing that if I do --grow -z again back to the old size, though, it will start syncing and immediately trash all my data. Is there any way to get the RAID system to do the right thing, and recognize the full partitions again?

I'm going to start by copying the full contents of one of the 1 TB drives onto the 2 TB drive, so that I have a copy I'm not worried about trashing. Any help would be appreciated beyond the ability of words to express.

As an aside, why in the world would mdadm allow you to shrink an active array that was in use without even a warning?

Original post:

Executive summary: LVM refuses to recognize my physical volumes. I can see the LVM label right there at the start of the disk (cat /dev/foo), but I get the extremely unhelpful error message in the topic when I try to pvdisplay /dev/foo.

The details:

I have three 1 TB drives, with four RAID-1 partitions each. The first RAID-1 is the system disk, without LVM, and the other three are joined together with LVM. I was attempting to swap out one of the 1 TB drives for a 2 TB drive, by duplicating the partition table (sfdisk -d /dev/sda | sfdisk /dev/sdd), growing the RAIDs (mdadm --grow /dev/md0 -z 4  ;  mdadm --add /dev/md0 /dev/sdd1) and then --fail and --remove'ing all the RAID partitions on the old 1 TB drive.

I still haven't the slightest clue what went wrong. I successfully copied the partition table, and the --grow commands executed without error. Yet, immediately after I tried to add the additional RAID partition to the system RAID, every command came back "command not found". "Oh crap, did I some how hose my / filesystem?"

Being completely unable to do anything further, I rebooted the system and hoped for the best. Of course, I got the dreaded kernel panic, unable to mount root. I pulled out my Gentoo install CD and booted off that, and was able to see that all the RAID devices were intact and synced. I started them (mdadm --assemble ...) without any problems.

And that's where I am now. My system RAID seems to have been zeroed out somehow (how is that even possible?), but I can see the LVM labels on my three data RAIDs. Why doesn't pvdisplay recognize them? Even pvdisplay -d -v doesn't provide any extra information at all.

Super extra thanks,

GregLast edited by gregp01 on Thu Feb 18, 2010 4:15 am; edited 2 times in total

----------

## NeddySeagoon

gregp01,

You may be able to get at your data by using mount with the raw drive (no partition numbers) and an alternate filesystem superblock.

As you say there may be some damage where the new raid superblock was written.

The format of the command is 

```
mount -o ro,offset=32256,sb=131072 -t ext3 /dev/sda /mnt/<someplace>
```

This says to do a read only mount of the filesystem starting at byte 32256 from the start of sda, using the filesystem superbock at block 131072.

Thats the same as mounting /dev/sda1 with the first alternate superbock, if the filesystem uses a 4k block size.

Practicing read only is harmless. Read 

```
man mount
```

With the raid superblock in there too, he offset may not be correct - I've never needed it on a raid volume.

----------

## gregp01

NeddySeagoon, thank you! I left my dd copy going overnight along with an fsck.reiserfs on the system RAID copy (since it finished first), and to my complete and utter shock, no corruption was found on the system fs.

I've just now reassembled (degraded) my data RAID on the copy and resized it back to the proper size. LVM found itself without any problems, and I'm currently running fsck.ext3 -n. It found unspecified errors immediately, but at least it wasn't so bad that it couldn't even find a place to start. Maybe the new superblock mostly wiped out the journal instead of the important data structures? Anyway, I'm going to see what I end up with, and most likely use your suggestion to mount it and copy everything to a fresh fs instance, since I wouldn't trust it even if the fsck claimed to fix everything.

----------

## gregp01

My fsck.ext3 on my data LVM volume also, shockingly, found no serious issues (sort of). I first tried -n without using a backup superblock:

```

# fsck.ext3 -n lvm-raid0-750gb/raid0-750gb-all

Warning: skipping journal recovery because doing a read-only filesystem check.

lvm-raid0-750gb/raid0-750gb-all contains a file system with errors, check forced.

Pass 1: Checking inodes, blocks, and sizes

Pass 2: Checking directory structure

Pass 3: Checking directory connectivity

Pass 4: Checking reference counts

Pass 5: Checking group summary information

Free inodes count wrong (88198990, counted=88198991).

Fix? no

lvm-raid0-750gb/raid0-750gb-all: ********** WARNING: Filesystem still has errors **********

lvm-raid0-750gb/raid0-750gb-all: 143538/88342528 files (1.9% non-contiguous), 157311984/176655360 blocks

```

But when I used one of the backup superblocks, it found thousands of "groups" with incorrect free inode counts:

```

# fsck.ext3 -n -b 229376 lvm-raid0-750gb/raid0-750gb-all

ext3 recovery flag is clear, but journal has data.

Recovery flag not set in backup superblock, so running journal anyway.

Clear journal? no

lvm-raid0-750gb/raid0-750gb-all was not cleanly unmounted, check forced.

Pass 1: Checking inodes, blocks, and sizes

Pass 2: Checking directory structure

Pass 3: Checking directory connectivity

Pass 4: Checking reference counts

Pass 5: Checking group summary information

Free blocks count wrong for group #1 (29650, counted=0).

Fix? no

Free blocks count wrong for group #2 (32254, counted=0).

Fix? no

Free blocks count wrong for group #3 (31229, counted=6).

Fix? no

...

Free inodes count wrong for group #5376 (16384, counted=16382).

Fix? no

Free inodes count wrong (88342517, counted=88198991).

Fix? no

lvm-raid0-750gb/raid0-750gb-all: ********** WARNING: Filesystem still has errors **********

lvm-raid0-750gb/raid0-750gb-all: 11/88342528 files (25209.1% non-contiguous), 2822746/176655360 blocks

```

I'm guessing that this is because the backup superblocks aren't updated very often? And of course the >100% non-contiguous stat is rather concerning, but nowhere near as much as the suggestion that I only had 11 files on the volume.

I ran another -n fsck with a different backup superblock, with essentially the same result (thousands of wrong free inode counts, very few files found). So, I decided to use the primary superblock for the rw fsck run:

```

# fsck.ext3 -p lvm-raid0-750gb/raid0-750gb-all

lvm-raid0-750gb/raid0-750gb-all: recovering journal

lvm-raid0-750gb/raid0-750gb-all: clean, 143537/88342528 files, 157311984/176655360 blocks

```

Obviously, I didn't realize that -p meant "only check the journal", so I ran again with -y -f:

```

# fsck.ext3 -y -f lvm-raid0-750gb/raid0-750gb-all

Pass 1: Checking inodes, blocks, and sizes

Pass 2: Checking directory structure

Pass 3: Checking directory connectivity

Pass 4: Checking reference counts

Pass 5: Checking group summary information

lvm-raid0-750gb/raid0-750gb-all: 143537/88342528 files (1.9% non-contiguous), 157311984/176655360 blocks

```

Hooray! To my utter astonishment, it seems as if I somehow haven't lost any data at all. Is it possible that the RAID superblock goes at the end of the device, instead of the end of the allocated array space? That seems a bit strange, but even if it isn't the case, I can (probably) live with having a couple of random files corrupted. I'm still going to start with a fresh fs instance, though, and copy everything over. Hopefully everything won't go all pear-shaped during the copy...

----------

## gregp01

As I suspected, mdadm --grow -z max with more than one device does start a resync. Oh well, at least it's just a time sink now instead of complete data loss (I've got my hopefully-fixed copy on the new drive, plus the untouched accidentally-shrunk original 3rd drive).

After doing some googling (which, sigh, I obviously should have done first), it sounds like the old raidtools had --dangerous-no-resync which would have done what I wanted, but that the mdadm equivalent, --assume-clean, only works with --build, which creates an array without superblocks.

----------

## NeddySeagoon

gregp01,

Good luck recovering your data. There is no automatic way to ensure that your files contain the data you expect.

fsck only checks that the metadata is self consistant. It says nothing about the user data on the filesystem.

In short - you will have to inspect every file yourself.

----------

## gregp01

Thanks again for the help. I know I can't be sure about the data in my files, but I was quite surprised that fsck didn't find any damage to the filesystem itself - I would've thought that the RAID superblock for an 8 kB array would trash the very part of the original array where the primary superblock, inode tables, etc. were stored. Either I got ridiculously lucky, or the RAID superblock goes at the end of the device instead of the end of the used array space. (Or my assumption of the ext3 layout is wrong, which may actually be the most likely possibility...)

As for the files, so far so good. My system was able to boot itself, nothing has crashed horribly yet, and all the data files I've used so far have had no detectable corruption. Plus, this finally provided the motivation to stop being a lazy moron and set up at actual backup solution :-)   I used to use rsnapshot for a subset of my files, but even that was just saving to the same filesystem-on-LVM-on-RAID as the rest of my data, and I never got around to fixing it when it stopped working for no obvious reason like 3 years ago. So, yeah, I guess this was a net win?

----------

