# [SOLVED] drive was removed from my RAID 5 array; is it dead?

## mhelvens

Hello all!

My /home dir consists of a RAID 5 array with three 1.5TB disks. Yesterday I did an `emerge --update --deep world`. Today upon reboot, /home didn't mount.

So, I did an `mdadm --assemble --scan` and got the message:

```
mdadm: /dev/md127 has been started with 2 drives (out of 3)
```

`mdadm --detail /dev/md127` now shows one of the drives as 'removed':

```
mhelvens-pc mhelvens # mdadm --detail /dev/md127

/dev/md127:

        Version : 1.2

  Creation Time : Thu Oct 20 19:41:06 2011

     Raid Level : raid5

     Array Size : 2930272256 (2794.53 GiB 3000.60 GB)

  Used Dev Size : 1465136128 (1397.26 GiB 1500.30 GB)

   Raid Devices : 3

  Total Devices : 2

    Persistence : Superblock is persistent

    Update Time : Mon Oct 29 14:06:06 2012

          State : clean, degraded 

 Active Devices : 2

Working Devices : 2

 Failed Devices : 0

  Spare Devices : 0

         Layout : left-symmetric

     Chunk Size : 512K

           Name : michiel-pc:0

           UUID : 82da8dc5:42efff78:bcce5cab:0baa4591

         Events : 24760

    Number   Major   Minor   RaidDevice State

       3       8       17        0      active sync   /dev/sdb1

       1       8       33        1      active sync   /dev/sdc1

       2       0        0        2      removed
```

If I press on, and tell `mdadm` specifically about /dev/sdd1, I will get something like:

```
mdadm: /dev/md/michiel-pc:0_0 assembled from 1 drive - not enough to start the array
```

Also `mdadm /dev/md127 --re-add /dev/sdd1` doesn't work:

```
mdadm: --re-add for /dev/sdd1 to /dev/md127 is not possible
```

Of course, it occurred to me that the drive may be dead (that's why I have RAID5, after all). But it seems like too much of a coincidence that this happened after a long overdue world update where I... didn't pay particular attention to the messages afterwards.

How can I be sure?

Thanks in advance!Last edited by mhelvens on Tue Oct 30, 2012 7:53 pm; edited 1 time in total

----------

## DaggyStyle

far from being an expert but can you see anything on that drive? partition tables? smart status?

also if I'm not mistaken, in order to get redundancy feature working in raid5 (e.g. loose one drive, data still intact) requires 4 drives, running raid5 on three drives is raid 5 without redundancy.

----------

## mhelvens

 *DaggyStyle wrote:*   

> far from being an expert but can you see anything on that drive? partition tables? smart status?

 

Looks like. In fdisk I can still see this info (looks fine):

```
Disk /dev/sdd: 1500.3 GB, 1500301910016 bytes, 2930277168 sectors

Units = sectors of 1 * 512 = 512 bytes

Sector size (logical/physical): 512 bytes / 512 bytes

I/O size (minimum/optimal): 512 bytes / 512 bytes

Disk identifier: 0xa7a50b94

   Device Boot      Start         End      Blocks   Id  System

/dev/sdd1            2048  2930277167  1465137560   fd  Linux raid autodetect
```

Is there any other specific info I should look up?

 *DaggyStyle wrote:*   

> also if I'm not mistaken, in order to get redundancy feature working in raid5 (e.g. loose one drive, data still intact) requires 4 drives, running raid5 on three drives is raid 5 without redundancy.

 

No, that's not true. RAID5 uses 1 drive for redundancy when you have 3 drives total or more. Right now the array is running fine on 2 drives, without redundancy.

----------

## DaggyStyle

 *mhelvens wrote:*   

>  *DaggyStyle wrote:*   far from being an expert but can you see anything on that drive? partition tables? smart status? 
> 
> Looks like. In fdisk I can still see this info (looks fine):
> 
> ```
> ...

 

ok, it seems that my IT guy is wrong.

I think it maybe worthwhile to check the superblock on that drive and see if it matches the other's superblock, also, did you upgraded your kernel before it happened?

in addtion, try to run smartctl on the drive and get some data and see if the drive is not pre fail.

----------

## mhelvens

 *DaggyStyle wrote:*   

> I think it maybe worthwhile to check the superblock on that drive and see if it matches the other's superblock,

 

I'm not sure how to do that.

 *DaggyStyle wrote:*   

> also, did you upgraded your kernel before it happened?

 

Nope.

 *DaggyStyle wrote:*   

> in addtion, try to run smartctl on the drive and get some data and see if the drive is not pre fail.

 

Brilliant! Never used that. Anyway, the drive in question PASSED with flying colours. No errors reported, etc.

Seems the drive is not dying. Just don't know how to add it back to the array.

Should I try --add? I didn't want to try that yet, as I assumed this would add it as a new drive, and completely resync.

----------

## DaggyStyle

 *mhelvens wrote:*   

>  *DaggyStyle wrote:*   I think it maybe worthwhile to check the superblock on that drive and see if it matches the other's superblock, 
> 
> I'm not sure how to do that.
> 
>  *DaggyStyle wrote:*   also, did you upgraded your kernel before it happened? 
> ...

 

there are specific entries to watch in smartctl{s output, search the forum.

why didn't you tried to add it again?

as for superblock I assume that dd and md5sum is the way.

----------

## mhelvens

 *DaggyStyle wrote:*   

> there are specific entries to watch in smartctl{s output, search the forum.

 

I looked at all the info. Everything looking good. Health status: PASSED. Everything completed without errors. I ran a short self-test: no errors.

 *DaggyStyle wrote:*   

> why didn't you tried to add it again?

 

Because the drive was already in the array before, so I assumed it would be a quick fix. When I --add, it takes quite a while to complete.

I now guess that a 'quick' fix (like --re-add) couldn't work because while the array was mounted with only two drives there were write-actions. So perhaps the third drive was now inconsistent? Just guessing.

So I used --add anyway. It's now recovering. 1150 minutes to go.  :Wink:  I assume it will go faster once I stop using the array. But right now, I have no choice. Work to complete.

Thanks! I'll report back when it completes.

----------

## energyman76b

ok, only read the start and not all of the rest.

I hope you didn't do anything stupid in the mean time.

First of all: 

most of the time a drive is not added to an array nothing serious happened. Driver was not done initializing hardware and similar stuff. Nothing bad. Just timing.

smartctl is a good call. Please have smartd run. Always. Especially with raid devices.

Check dmesg. No errors? 

Then continue:

First things first: log out as user. Unmout /home. The less you do on that FS the smaller the chance that the resync will run into problems.

Second. Stop the array 

mdadm -S /dev/md127 or whatitscalled

Third. Start the array

mdadm -A /dev/mc127

and report back.

I almost never experience problems with kernel assembled arrays. But the 'new' superblock 1.2 arrays that have to be assembled by mdraid during boot are a different story....

Edit: read the rest now.

Ok, add/readd might work or not... restarting is easier...

the next time your array is degraded and not mounted, don't mount it. Fix it first. If it is degraded and mounted... well...

----------

## mhelvens

 *energyman76b wrote:*   

> ok, only read the start and not all of the rest.
> 
> I hope you didn't do anything stupid in the mean time.

 

Maybe I did. But if so, it appears to have gone well enough.

Please read on and help me find out if everything is fine now?

 *energyman76b wrote:*   

> smartctl is a good call. Please have smartd run. Always. Especially with raid devices.

 

Thanks for the tip! I'll find out more about smartd. Until now I've been running the mdadm daemon, which is also supposed to warn me if anything goes wrong. Would that be redundant?

 *energyman76b wrote:*   

> Check dmesg. No errors? 
> 
> Then continue:
> 
> First things first: log out as user. Unmout /home. The less you do on that FS the smaller the chance that the resync will run into problems.
> ...

 

Did all that (sort of). I reported the outcome in my first post. I just neglected to mention some of the steps (unmount, stop array, etc.). Except that I started right away with --assemble --scan.

Anyway, now that you've read the rest... So I did an --add. It was recovering the array all night, and it seems to have gone ok. Here's `mdadm --detail /dev/md127`:

```
mhelvens-pc / # mdadm --detail /dev/md127 

/dev/md127:

        Version : 1.2

  Creation Time : Thu Oct 20 19:41:06 2011

     Raid Level : raid5

     Array Size : 2930272256 (2794.53 GiB 3000.60 GB)

  Used Dev Size : 1465136128 (1397.26 GiB 1500.30 GB)

   Raid Devices : 3

  Total Devices : 3

    Persistence : Superblock is persistent

    Update Time : Tue Oct 30 18:48:08 2012

          State : clean 

 Active Devices : 3

Working Devices : 3

 Failed Devices : 0

  Spare Devices : 0

         Layout : left-symmetric

     Chunk Size : 512K

           Name : michiel-pc:0

           UUID : 82da8dc5:42efff78:bcce5cab:0baa4591

         Events : 56187

    Number   Major   Minor   RaidDevice State

       3       8       17        0      active sync   /dev/sdb1

       1       8       33        1      active sync   /dev/sdc1

       4       8       49        2      active sync   /dev/sdd1
```

As you can see, it looks fine. Only thing is: The 'Number' column now shows '4' for sdd1. As far as mdadm is concerned, I suppose, nr. 2 died, and I put nr. 4 in its place.

Anyway, can you recommend any final tests to make sure everything is OK?

 *energyman76b wrote:*   

> the next time your array is degraded and not mounted, don't mount it. Fix it first. If it is degraded and mounted... well...

 

I'll remember that!

Thanks!Last edited by mhelvens on Tue Oct 30, 2012 6:00 pm; edited 2 times in total

----------

## Jaglover

I had similar issue with my RAID-0, looking at /proc/mdstat I found I had two broken arrays - md0 and md127 instead of one working array. After fiddling with /etc/mdadm.conf it started working.

----------

## energyman76b

mdadm will only scream when a disk is dead.

smartd can warn you so you might be able to act before the disk is dead.

It also runs self tests - if you configure it that way - which also help to find worrisome developments. 

For example this:

199 UDMA_CRC_Error_Count    0x0036   100   100   000    Old_age   Always       -       2

200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always       -       2

223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       1

happens. if it does not change over weeks or month nothing to worry about. But if it goes up quickly... get a new disk ASAP.

196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0

197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0

or those two. One or two? might happen. Constant growth? Time for a backup.

----------

## mhelvens

Ok. Thanks!

----------

## Jaglover

And the solution was?

----------

## mhelvens

 *Jaglover wrote:*   

> And the solution was?

 

I described the 'solution' in my earlier post. I used `mdadm --add` to add the drive back into the array. It had to completely resync, but is working fine now.

This was not the ideal solution. I could have possibly let it 'catch back up' with the array, but I had already mounted it, and I didn't know how.

I marked the topic as [SOLVED] because my problem is now gone. Is this not common practice?

Cheers!

----------

