# Raid 1 sync problem after replacing failed drive

## kilburna

Hi

I have a raid 1 in a small server as below. When I add the md0 and md1 back after replacing the drive, sync completes. But when I do mdadm /dev/md2 --add /dev/sdb3 the check with cat /proc/mdstat it will get up to 92% then silently exit. Every time I restart the server the sync will start and silently fail at 92%.

I also tried to execute echo check >> /sys/block/md2/md/sync_action to check for bad blocks, but when I execute this it does not seem to do anything seen from cat/proc/mdstat.

Any pointers on how to proceed?

```

EBox ~ # cat /proc/mdstat

Personalities : [raid1]

md1 : active raid1 sdb2[1] sda2[0]

      6143936 blocks [2/2] [UU]

md2 : active raid1 sdb3[2](S) sda3[0]

      306393344 blocks [2/1] [U_]

md0 : active raid1 sdb1[1] sda1[0]

      32704 blocks [2/2] [UU]

unused devices: <none>

```

```

EBox ~ # mdadm --detail /dev/md2

/dev/md2:

        Version : 0.90

  Creation Time : Thu Oct 18 07:51:26 2012

     Raid Level : raid1

     Array Size : 306393344 (292.20 GiB 313.75 GB)

  Used Dev Size : 306393344 (292.20 GiB 313.75 GB)

   Raid Devices : 2

  Total Devices : 2

Preferred Minor : 2

    Persistence : Superblock is persistent

    Update Time : Thu Feb 11 18:20:21 2016

          State : clean, degraded

 Active Devices : 1

Working Devices : 2

 Failed Devices : 0

  Spare Devices : 1

           UUID : f1348aae:8e9ffe0f:cb201669:f728008a

         Events : 0.42206763

    Number   Major   Minor   RaidDevice State

       0       8        3        0      active sync   /dev/sda3

       -       0        0        1      removed

       2       8       19        -      spare   /dev/sdb3

```

Regards

kilburna

----------

## kilburna

Done a dmesg and noticed this 

```

[ 4179.275418] ata1.00: failed command: READ FPDMA QUEUED

[ 4179.275424] ata1.00: cmd 60/00:78:80:08:d6/05:00:22:00:00/40 tag 15 ncq 655360 in

                        res 41/40:00:b2:09:d6/00:05:22:00:00/00 Emask 0x409 (media error) <F>

[ 4179.275435] ata1.00: status: { DRDY ERR }

[ 4179.275436] ata1.00: error: { UNC }

[ 4179.275438] ata1.00: failed command: READ FPDMA QUEUED

[ 4179.275441] ata1.00: cmd 60/00:80:80:0d:d6/05:00:22:00:00/40 tag 16 ncq 655360 in

                        res 41/04:00:b2:09:d6/00:00:22:00:00/00 Emask 0x1 (device error)

[ 4179.275443] ata1.00: status: { DRDY ERR }

[ 4179.275444] ata1.00: error: { ABRT }

[ 4179.275445] ata1.00: failed command: READ FPDMA QUEUED

[ 4179.275448] ata1.00: cmd 60/80:88:80:12:d6/04:00:22:00:00/40 tag 17 ncq 589824 in

                        res 41/04:00:b2:09:d6/00:00:22:00:00/00 Emask 0x1 (device error)

[ 4179.275450] ata1.00: status: { DRDY ERR }

[ 4179.275451] ata1.00: error: { ABRT }

[ 4179.275452] ata1.00: failed command: READ FPDMA QUEUED

[ 4179.275455] ata1.00: cmd 60/80:90:00:17:d6/00:00:22:00:00/40 tag 18 ncq 65536 in

                        res 41/04:00:b2:09:d6/00:00:22:00:00/00 Emask 0x1 (device error)

[ 4179.275456] ata1.00: status: { DRDY ERR }

[ 4179.275457] ata1.00: error: { ABRT }

[ 4179.275459] ata1.00: failed command: READ FPDMA QUEUED

[ 4179.275461] ata1.00: cmd 60/80:98:80:17:d6/00:00:22:00:00/40 tag 19 ncq 65536 in

                        res 41/04:00:b2:09:d6/00:00:22:00:00/00 Emask 0x1 (device error)

[ 4179.275463] ata1.00: status: { DRDY ERR }

[ 4179.275464] ata1.00: error: { ABRT }

[ 4179.301951] ata1.00: configured for UDMA/133

[ 4179.301977] sd 0:0:0:0: [sda] tag#15 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE

[ 4179.301982] sd 0:0:0:0: [sda] tag#15 Sense Key : Medium Error [current] [descriptor]

[ 4179.301986] sd 0:0:0:0: [sda] tag#15 Add. Sense: Unrecovered read error - auto reallocate failed

[ 4179.301997] sd 0:0:0:0: [sda] tag#15 CDB: Read(10) 28 00 22 d6 08 80 00 05 00 00

[ 4179.301999] blk_update_request: I/O error, dev sda, sector 584452530

[ 4179.302011] ata1: EH complete

[ 4181.875511] ata1.00: exception Emask 0x0 SAct 0x800 SErr 0x0 action 0x0

[ 4181.875515] ata1.00: irq_stat 0x40000008

[ 4181.875518] ata1.00: failed command: READ FPDMA QUEUED

[ 4181.875524] ata1.00: cmd 60/08:58:b0:09:d6/00:00:22:00:00/40 tag 11 ncq 4096 in

                        res 41/40:08:b2:09:d6/00:00:22:00:00/00 Emask 0x409 (media error) <F>

[ 4181.875527] ata1.00: status: { DRDY ERR }

[ 4181.875529] ata1.00: error: { UNC }

[ 4181.879253] ata1.00: configured for UDMA/133

[ 4181.879267] sd 0:0:0:0: [sda] tag#11 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE

[ 4181.879271] sd 0:0:0:0: [sda] tag#11 Sense Key : Medium Error [current] [descriptor]

[ 4181.879275] sd 0:0:0:0: [sda] tag#11 Add. Sense: Unrecovered read error - auto reallocate failed

[ 4181.879291] sd 0:0:0:0: [sda] tag#11 CDB: Read(10) 28 00 22 d6 09 b0 00 00 08 00

[ 4181.879293] blk_update_request: I/O error, dev sda, sector 584452530

[ 4181.879303] ata1: EH complete

[ 4181.879333] md/raid1:md2: sda: unrecoverable I/O read error for block 572096896

[ 4181.879359] md: md2: recovery interrupted.

```

This is probably why the sync is not completing. sda3 has an I/O read error.  Is I/O errors something that echo check >> /sys/block/md2/md/sync_action should take care of. 

Is there a way to force sdb3 to be add the array.

----------

## salahx

Unfortunately it appears that md2 only has 1 "leg", /dev/sda3, as the other leg is spare. So the only "leg" you have is bad. I recommend you back up /dev/sda3. If you just have a small handful of bad sectors, you can repair them with hdparm. dmesg willl give you the sector numbers that are bad, so you cna pass thsoe to the --write-sector and --read-sector

```

hdparm --write-sector 584452530  /dev/sda

hdparm --read-sector 584452530 /dev/sda

```

This should force the drive to reallocate the sector. Repeat for each sector. Note that --write-sector is a destructive operation and hdparm will require confirmation this is really you want. Then repeat the sync procedure, if you run into more bad sectors repeat the above procedure with the new sector number until the sync completes 

You can use smartctl to see how far gone drive is, however you should probably replace /dev/sda as soon as possible.

----------

## NeddySeagoon

kilburna,

You good drive has errors as salahx says.

```
[ 4181.879275] sd 0:0:0:0: [sda] tag#11 Add. Sense: Unrecovered read error - auto reallocate failed
```

Unmount the 'good' drive and use ddrescue to image the good drive onto the replacement drive.

You must run ddrescue with a log file saved on a volume that is neither the source nor destination of the rescue.

The log is used by you to see what is happening and by ddrescue to resume or retry.

ddrescue tries very hard to get one last read out of bad sectors.  If it works, it will recover your unreadable data.

If not, the data is gone.

Once ddrescue has done its stuff, replace the source drive.

The output of smartctl -a /dev/... would be good.  emerge smartmontools if you need to.

----------

## kilburna

Thanks salahx and NeddySeagoon for your assistance. I followed salahx advise. There were 3 bad sectors on /dev/sda but eventually the raid synced with /dev/sdb. After that I replaced /dev/sda. All fine.

As a small stat to report, I have had to replace 6 Seagate drives in the last 5 years, whereas I have not yet replaced any Hitachi drives used on other servers in the same period.  

Would btrfs raid faired any better than ext4+raid?

Thanks again

----------

## NeddySeagoon

kilburna,

You had hardware problems.  Nothing can fix that.  That what backups are for.

Running a repair every month may provide an early warning of problems.

Keep an eye on the reallocated sector count.

I had two WD Greens in a 5 spindle raid5 set fail within 15 min of one another  :(

I've had one Hitachi fail. all over the last 6 years.

Actually, the Hitachi still works but it has a large dead spot 10G from the start.

I use it for an image of my work laptop, that I boot in Virtual Box, so its quite expendable.

----------

