# Can't resync RAID1 because of errors - how to fix them?

## JonW

I'm trying to add a disk to a RAID1 volume (previously installed with only one drive), but it won't complete the resync because of errors on the 'first' disk. Smartctl's been telling me about these errors for a little while (hence adding the new disk.) I'm getting a whole lot of errors like this before the resync terminated and restarts:-

```
Oct 29 19:37:15 server end_request: I/O error, dev hda, sector 24647620

Oct 29 19:37:15 server hda: dma_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }

Oct 29 19:37:15 server hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=24647620, high=1, low=7870404, sector=24647620

Oct 29 19:37:15 server ide: failed opcode was: unknown

Oct 29 19:37:15 server end_request: I/O error, dev hda, sector 24647620

Oct 29 19:37:15 server raid1: hda: unrecoverable I/O read error for block 23555200

Oct 29 19:37:16 server md: md2: sync done.
```

The original disk is /dev/hda. The new one's /dev/hdd.

Smartctl -a /dev/hda shows this:

```
server log # smartctl -a /dev/hda

smartctl version 5.36 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen

Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===

Device Model:     SAMSUNG SP1614N

<snip>

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000b   100   100   051    Pre-fail  Always       -       4

  3 Spin_Up_Time            0x0007   065   055   000    Pre-fail  Always       -       5952

  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       597

  5 Reallocated_Sector_Ct   0x0033   253   253   010    Pre-fail  Always       -       0

  7 Seek_Error_Rate         0x000b   253   253   051    Pre-fail  Always       -       0

  8 Seek_Time_Performance   0x0024   091   088   000    Old_age   Offline      -       9299

  9 Power_On_Half_Minutes   0x0032   096   096   000    Old_age   Always       -       20883h+31m

 10 Spin_Retry_Count        0x0013   253   253   049    Pre-fail  Always       -       0

 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       449

194 Temperature_Celsius     0x0022   157   085   000    Old_age   Always       -       27

195 Hardware_ECC_Recovered  0x000a   100   100   000    Old_age   Always       -       277318123

196 Reallocated_Event_Count 0x0012   099   099   000    Old_age   Always       -       3

197 Current_Pending_Sector  0x0033   253   253   010    Pre-fail  Always       -       0

198 Offline_Uncorrectable   0x0031   099   099   010    Pre-fail  Offline      -       3

199 UDMA_CRC_Error_Count    0x000b   100   100   051    Pre-fail  Always       -       0

200 Multi_Zone_Error_Rate   0x000b   100   100   051    Pre-fail  Always       -       0

201 Soft_Read_Error_Rate    0x000b   100   100   051    Pre-fail  Always       -       0

<snip>

Error 2 occurred at disk power-on lifetime: 12272 hours (511 days + 8 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  84 51 01 01 00 00 a0  Error: ICRC, ABRT 1 sectors at LBA = 0x00000001 = 1

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  c8 00 01 01 00 00 a0 00      00:00:17.500  READ DMA

  ef 03 45 01 10 c2 a0 00      00:00:17.500  SET FEATURES [Set transfer mode]

  ef 03 0c 01 10 c2 a0 00      00:00:17.500  SET FEATURES [Set transfer mode]

  c6 da 10 01 4f c2 a0 00      00:00:17.500  SET MULTIPLE MODE

  b0 da 00 01 4f c2 a0 00      00:00:17.438  SMART RETURN STATUS

<snip>

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline    Completed: read failure       00%     20847         24647620

<snip>
```

I've followed the instructions at http://smartmontools.sourceforge.net/BadBlockHowTo.txt to fix the bad blocks. The block with the error is on /dev/hda3, which is a member of /dev/md2. This is my root volume, running ext3. When I get as far as running debugfs to find the file with the bad block, it fails to find the inode:-

```
 server log # debugfs

debugfs 1.39 (29-May-2006)

debugfs:  open /dev/md2

debugfs:  icheck 2944400

Block   Inode number

2944400 <block not found>

```

and so I can't work out which file has the problem.

Is there any other way to fix the disk error?

Cheers,

Jon

----------

## Rad

Since it's raid 1 you ultimately want, you could start by dd-rescue (or ddrescue) copying the partition / drive from hda to hdd to prevent loosing data in case the problem with hda escalates. A file-by-file copy may also work.

You can then recreate your raid1 array on md2 from that copy, which will probably be the simplest thing to do. Once you have a non-degraded array, linux md raid should be able to work around bad blocks, although you may want to heed this advice and periodically check your array.

----------

## JonW

Thanls for the tip.

I ran the following to stop the rebuild process:-

```
mdadm --manage /dev/md2 --fail /dev/hdd3
```

And then the following to copy /dev/hda3 to /dev/hdd3

```
server ~ # ddrescue /dev/hda3 /dev/hdd3

Press Ctrl-C to interrupt

rescued:    30005 MB,  errsize:   65536 B,  current rate:        0 B/s

   ipos:    12060 MB,   errors:     130,    average rate:   30095 kB/s

   opos:    12060 MB

```

How do I rebuild the array from hdd3? Should I just reboot using hdd?

Cheers,

Jon

----------

## Rad

The idea was to copy the partition, remove the current md2 array alltogether, initialize a new (degraded) one with only hdd3 as member, and then try and add hda3 to that.

Oh and hda is probably very much overheated, about to fail, or both. Just thought I'd mention it in case there's as of yet uncopied backup worthy data on the other partitions...

----------

