# RAID - multiple problems

## kevintshaver

I have a 7 disk RAID-5 that's been running properly for a long time. A few days ago I had a problem. First, I have one disk that is failing. Disks are sda, sdb, sdc, sdd, sdf, sdg, sdh.

```
localhost ~ # smartctl --all /dev/sdg

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x002f   157   157   051    Pre-fail  Always       -       7628

```

Why does smartctl say "PASSED" when there's a huge amount of Raw_Read_Error_Rate? At any rate, all 6 other disks are at 0 for read errors.

Here are some parts of the messages log when the failures happened.

```
Nov  4 13:19:49 localhost kernel: [136240.350504] ata1.00: SATA link down (SStatus 0 SControl 0)

Nov  4 13:19:54 localhost kernel: [136245.350074] ata1.00: hard resetting link

Nov  4 13:19:55 localhost kernel: [136246.357074] ata1.00: failed to resume link (SControl 0)

Nov  4 13:19:55 localhost kernel: [136246.357481] ata1.00: SATA link down (SStatus 0 SControl 0)

Nov  4 13:19:55 localhost kernel: [136246.357669] ata1.00: limiting SATA link speed to 1.5 Gbps

Nov  4 13:20:00 localhost kernel: [136251.357023] ata1.00: hard resetting link

Nov  4 13:20:01 localhost cron[8236]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons )

Nov  4 13:20:01 localhost kernel: [136252.364086] ata1.00: failed to resume link (SControl 0)

Nov  4 13:20:01 localhost kernel: [136252.364513] ata1.00: SATA link down (SStatus 0 SControl 0)

Nov  4 13:20:01 localhost kernel: [136252.364657] ata1.00: disabled

Nov  4 13:20:01 localhost kernel: [136252.364817] ata1: EH complete

Nov  4 13:20:01 localhost kernel: [136252.364829] sd 0:0:0:0: rejecting I/O to offline device

Nov  4 13:20:01 localhost kernel: [136252.364968] sd 0:0:0:0: [sda] Unhandled error code

Nov  4 13:20:01 localhost kernel: [136252.364974] sd 0:0:0:0: [sda]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK

Nov  4 13:20:01 localhost kernel: [136252.364983] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 84 09 7b 3f 00 00 08 00

Nov  4 13:20:01 localhost kernel: [136252.365026] end_request: I/O error, dev sda, sector 2215213887

Nov  4 13:20:01 localhost kernel: [136252.365039] md/raid:md127: Disk failure on sda1, disabling device.

Nov  4 13:20:01 localhost kernel: [136252.365043] md/raid:md127: Operation continuing on 6 devices.

Nov  4 13:20:01 localhost kernel: [136252.365078] sd 0:0:0:0: rejecting I/O to offline device

Nov  4 13:20:01 localhost kernel: [136252.365122] ata1.00: detaching (SCSI 0:0:0:0)

Nov  4 13:20:01 localhost kernel: [136252.379216] sd 0:0:0:0: [sda] Stopping disk

Nov  4 13:20:01 localhost kernel: [136252.736064] sd 0:0:0:0: [sda] START_STOP FAILED

Nov  4 13:20:01 localhost kernel: [136252.736072] sd 0:0:0:0: [sda]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

Nov  4 13:20:02 localhost kernel: [136253.900227] RAID conf printout:

Nov  4 13:20:02 localhost kernel: [136253.900235]  --- level:5 rd:7 wd:6

Nov  4 13:20:02 localhost kernel: [136253.900241]  disk 0, o:0, dev:sda1

Nov  4 13:20:02 localhost kernel: [136253.900246]  disk 1, o:1, dev:sdb1

Nov  4 13:20:02 localhost kernel: [136253.900252]  disk 2, o:1, dev:sdc1

Nov  4 13:20:02 localhost kernel: [136253.900257]  disk 3, o:1, dev:sdd1

Nov  4 13:20:02 localhost kernel: [136253.900262]  disk 4, o:1, dev:sdf

Nov  4 13:20:02 localhost kernel: [136253.900267]  disk 5, o:1, dev:sdg

Nov  4 13:20:02 localhost kernel: [136253.900272]  disk 6, o:1, dev:sdh

Nov  4 13:20:02 localhost kernel: [136253.905016] RAID conf printout:

Nov  4 13:20:02 localhost kernel: [136253.905023]  --- level:5 rd:7 wd:6

Nov  4 13:20:02 localhost kernel: [136253.905028]  disk 1, o:1, dev:sdb1

Nov  4 13:20:02 localhost kernel: [136253.905034]  disk 2, o:1, dev:sdc1

Nov  4 13:20:02 localhost kernel: [136253.905039]  disk 3, o:1, dev:sdd1

Nov  4 13:20:02 localhost kernel: [136253.905043]  disk 4, o:1, dev:sdf

Nov  4 13:20:02 localhost kernel: [136253.905048]  disk 5, o:1, dev:sdg

Nov  4 13:20:02 localhost kernel: [136253.905053]  disk 6, o:1, dev:sdh

Nov  4 13:20:02 localhost mdadm[4131]: Fail event detected on md device /dev/md127, component device /dev/sda1

Nov  4 13:20:02 localhost kernel: [136253.928992] md: unbind<sda1>

Nov  4 13:20:02 localhost kernel: [136253.936052] md: export_rdev(sda1)

Nov  4 13:20:04 localhost kernel: [136256.065984] ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x4050000 action 0xf

Nov  4 13:20:04 localhost kernel: [136256.065994] ata1.00: SError: { PHYRdyChg CommWake DevExch }

Nov  4 13:20:04 localhost kernel: [136256.066193] ata1.00: limiting SATA link speed to 1.5 Gbps

Nov  4 13:20:04 localhost kernel: [136256.066208] ata1.00: hard resetting link

Nov  4 13:20:05 localhost kernel: [136256.944371] ata1.00: SATA link up 3.0 Gbps (SStatus 123 SControl 310)

Nov  4 13:20:05 localhost kernel: [136256.944832] ata1.00: ATA-7: WDC WD20EARS-00MVWB0, 09570115, max UDMA/133

Nov  4 13:20:05 localhost kernel: [136256.944840] ata1.00: 3907029168 sectors, multi 1: LBA48 

Nov  4 13:20:05 localhost kernel: [136256.945314] ata1.00: configured for UDMA/133

Nov  4 13:20:05 localhost kernel: [136256.945460] ata1: EH complete

Nov  4 13:20:05 localhost kernel: [136256.945738] scsi 0:0:0:0: Direct-Access     ATA      WDC WD20EARS-00M 0957 PQ: 0 ANSI: 5

Nov  4 13:20:05 localhost kernel: [136256.946416] sd 0:0:0:0: [sda] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)

Nov  4 13:20:05 localhost kernel: [136256.946477] sd 0:0:0:0: Attached scsi generic sg0 type 0

Nov  4 13:20:05 localhost kernel: [136256.946585] sd 0:0:0:0: [sda] Write Protect is off

Nov  4 13:20:05 localhost kernel: [136256.946593] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00

Nov  4 13:20:05 localhost kernel: [136256.946661] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA

Nov  4 13:20:05 localhost kernel: [136256.972140]  sda: sda1

Nov  4 13:20:05 localhost kernel: [136256.972764] sd 0:0:0:0: [sda] Attached SCSI disk

Nov  4 13:36:12 localhost kernel: [137223.988215] ata2.01: exception Emask 0x0 SAct 0x2 SErr 0x0 action 0x0

Nov  4 13:36:12 localhost kernel: [137223.988226] ata2.01: failed command: READ FPDMA QUEUED

Nov  4 13:36:12 localhost kernel: [137223.988241] ata2.01: cmd 60/00:08:00:f7:0d/01:00:84:00:00/40 tag 1 ncq 131072 in

Nov  4 13:36:12 localhost kernel: [137223.988244]          res 41/40:00:00:f7:0d/00:00:84:00:00/40 Emask 0x409 (media error) <F>

Nov  4 13:36:12 localhost kernel: [137223.988251] ata2.01: status: { DRDY ERR }

Nov  4 13:36:12 localhost kernel: [137223.988256] ata2.01: error: { UNC }

Nov  4 13:36:12 localhost kernel: [137223.997843] ata2.01: configured for UDMA/133

Nov  4 13:36:12 localhost kernel: [137223.997980] ata2: EH complete

Nov  4 13:36:15 localhost kernel: [137226.880616] ata2.01: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0

Nov  4 13:36:15 localhost kernel: [137226.880626] ata2.01: failed command: READ FPDMA QUEUED

Nov  4 13:36:15 localhost kernel: [137226.880641] ata2.01: cmd 60/00:00:00:f7:0d/01:00:84:00:00/40 tag 0 ncq 131072 in

Nov  4 13:36:15 localhost kernel: [137226.880644]          res 41/40:00:00:f7:0d/00:00:84:00:00/40 Emask 0x409 (media error) <F>

Nov  4 13:36:15 localhost kernel: [137226.880651] ata2.01: status: { DRDY ERR }

Nov  4 13:36:15 localhost kernel: [137226.880657] ata2.01: error: { UNC }

Nov  4 13:36:15 localhost kernel: [137226.890139] ata2.01: configured for UDMA/133

Nov  4 13:36:15 localhost kernel: [137226.890277] ata2: EH complete

Nov  4 13:36:18 localhost kernel: [137229.773030] ata2.01: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0

Nov  4 13:36:18 localhost kernel: [137229.773040] ata2.01: failed command: READ FPDMA QUEUED

Nov  4 13:36:18 localhost kernel: [137229.773055] ata2.01: cmd 60/00:00:00:f7:0d/01:00:84:00:00/40 tag 0 ncq 131072 in

Nov  4 13:36:18 localhost kernel: [137229.773058]          res 41/40:00:00:f7:0d/00:00:84:00:00/40 Emask 0x409 (media error) <F>

Nov  4 13:36:18 localhost kernel: [137229.773065] ata2.01: status: { DRDY ERR }

Nov  4 13:36:18 localhost kernel: [137229.773070] ata2.01: error: { UNC }

Nov  4 13:36:18 localhost kernel: [137229.782439] ata2.01: configured for UDMA/133

Nov  4 13:36:18 localhost kernel: [137229.782575] ata2: EH complete

Nov  4 13:36:21 localhost kernel: [137232.665420] ata2.01: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0

Nov  4 13:36:21 localhost kernel: [137232.665430] ata2.01: failed command: READ FPDMA QUEUED

Nov  4 13:36:21 localhost kernel: [137232.665445] ata2.01: cmd 60/00:00:00:f7:0d/01:00:84:00:00/40 tag 0 ncq 131072 in

Nov  4 13:36:21 localhost kernel: [137232.665448]          res 41/40:00:00:f7:0d/00:00:84:00:00/40 Emask 0x409 (media error) <F>

Nov  4 13:36:21 localhost kernel: [137232.665455] ata2.01: status: { DRDY ERR }

Nov  4 13:36:21 localhost kernel: [137232.665460] ata2.01: error: { UNC }

Nov  4 13:36:21 localhost kernel: [137232.674736] ata2.01: configured for UDMA/133

Nov  4 13:36:21 localhost kernel: [137232.674873] ata2: EH complete

Nov  4 13:36:24 localhost kernel: [137235.557836] ata2.01: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0

Nov  4 13:36:24 localhost kernel: [137235.557845] ata2.01: failed command: READ FPDMA QUEUED

Nov  4 13:36:24 localhost kernel: [137235.557860] ata2.01: cmd 60/00:00:00:f7:0d/01:00:84:00:00/40 tag 0 ncq 131072 in

Nov  4 13:36:24 localhost kernel: [137235.557864]          res 41/40:00:00:f7:0d/00:00:84:00:00/40 Emask 0x409 (media error) <F>

Nov  4 13:36:24 localhost kernel: [137235.557871] ata2.01: status: { DRDY ERR }

Nov  4 13:36:24 localhost kernel: [137235.557876] ata2.01: error: { UNC }

Nov  4 13:36:24 localhost kernel: [137235.568037] ata2.01: configured for UDMA/133

Nov  4 13:36:24 localhost kernel: [137235.568174] ata2: EH complete

Nov  4 13:36:27 localhost kernel: [137238.450250] ata2.01: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0

Nov  4 13:36:27 localhost kernel: [137238.450260] ata2.01: failed command: READ FPDMA QUEUED

Nov  4 13:36:27 localhost kernel: [137238.450275] ata2.01: cmd 60/00:00:00:f7:0d/01:00:84:00:00/40 tag 0 ncq 131072 in

Nov  4 13:36:27 localhost kernel: [137238.450278]          res 41/40:00:00:f7:0d/00:00:84:00:00/40 Emask 0x409 (media error) <F>

Nov  4 13:36:27 localhost kernel: [137238.450285] ata2.01: status: { DRDY ERR }

Nov  4 13:36:27 localhost kernel: [137238.450290] ata2.01: error: { UNC }

Nov  4 13:36:27 localhost kernel: [137238.460333] ata2.01: configured for UDMA/133

Nov  4 13:36:27 localhost kernel: [137238.460482] sd 1:1:0:0: [sdg] Unhandled sense code

Nov  4 13:36:27 localhost kernel: [137238.460489] sd 1:1:0:0: [sdg]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE

Nov  4 13:36:27 localhost kernel: [137238.460497] sd 1:1:0:0: [sdg]  Sense Key : Medium Error [current] [descriptor]

Nov  4 13:36:27 localhost kernel: [137238.460506] Descriptor sense data with sense descriptors (in hex):

Nov  4 13:36:27 localhost kernel: [137238.460511]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 

Nov  4 13:36:27 localhost kernel: [137238.460530]         84 0d f7 00 

Nov  4 13:36:27 localhost kernel: [137238.460539] sd 1:1:0:0: [sdg]  Add. Sense: Unrecovered read error - auto reallocate failed

Nov  4 13:36:27 localhost kernel: [137238.460550] sd 1:1:0:0: [sdg] CDB: Read(10): 28 00 84 0d f7 00 00 01 00 00

Nov  4 13:36:27 localhost kernel: [137238.460568] end_request: I/O error, dev sdg, sector 2215507712

Nov  4 13:36:27 localhost kernel: [137238.460576] md/raid:md127: read error not correctable (sector 2215507712 on sdg).

Nov  4 13:36:27 localhost kernel: [137238.460585] md/raid:md127: Disk failure on sdg, disabling device.

Nov  4 13:36:27 localhost kernel: [137238.460588] md/raid:md127: Operation continuing on 5 devices.

Nov  4 13:36:27 localhost kernel: [137238.460608] md/raid:md127: read error not correctable (sector 2215507720 on sdg).

Nov  4 13:36:27 localhost kernel: [137238.460617] md/raid:md127: read error not correctable (sector 2215507728 on sdg).

Nov  4 13:36:27 localhost kernel: [137238.460626] md/raid:md127: read error not correctable (sector 2215507736 on sdg).

Nov  4 13:36:27 localhost kernel: [137238.460636] md/raid:md127: read error not correctable (sector 2215507744 on sdg).

Nov  4 13:36:27 localhost kernel: [137238.460645] md/raid:md127: read error not correctable (sector 2215507752 on sdg).

Nov  4 13:36:27 localhost kernel: [137238.460654] md/raid:md127: read error not correctable (sector 2215507760 on sdg).

Nov  4 13:36:27 localhost kernel: [137238.460663] md/raid:md127: read error not correctable (sector 2215507768 on sdg).

Nov  4 13:36:27 localhost kernel: [137238.460672] md/raid:md127: read error not correctable (sector 2215507776 on sdg).

Nov  4 13:36:27 localhost kernel: [137238.460734] ata2: EH complete

Nov  4 13:36:28 localhost kernel: [137239.325822] RAID conf printout:

Nov  4 13:36:28 localhost kernel: [137239.325832]  --- level:5 rd:7 wd:5

Nov  4 13:36:28 localhost kernel: [137239.325841]  disk 1, o:1, dev:sdb1

Nov  4 13:36:28 localhost kernel: [137239.325849]  disk 2, o:1, dev:sdc1

Nov  4 13:36:28 localhost kernel: [137239.325856]  disk 3, o:1, dev:sdd1

Nov  4 13:36:28 localhost kernel: [137239.325863]  disk 4, o:1, dev:sdf

Nov  4 13:36:28 localhost kernel: [137239.325870]  disk 5, o:0, dev:sdg

Nov  4 13:36:28 localhost kernel: [137239.325877]  disk 6, o:1, dev:sdh

Nov  4 13:36:28 localhost kernel: [137239.331041] RAID conf printout:

Nov  4 13:36:28 localhost kernel: [137239.331050]  --- level:5 rd:7 wd:5

Nov  4 13:36:28 localhost kernel: [137239.331059]  disk 1, o:1, dev:sdb1

Nov  4 13:36:28 localhost kernel: [137239.331067]  disk 2, o:1, dev:sdc1

Nov  4 13:36:28 localhost kernel: [137239.331074]  disk 3, o:1, dev:sdd1

Nov  4 13:36:28 localhost kernel: [137239.331082]  disk 4, o:1, dev:sdf

Nov  4 13:36:28 localhost kernel: [137239.331089]  disk 6, o:1, dev:sdh

Nov  4 13:36:28 localhost mdadm[4131]: Fail event detected on md device /dev/md127, component device /dev/sdg

Nov  4 13:36:28 localhost kernel: [137239.351134] JBD: Detected IO errors while flushing file data on dm-2

Nov  4 13:36:29 localhost kernel: [137240.264783] Aborting journal on device dm-2.

Nov  4 13:36:29 localhost kernel: [137240.264829] EXT3-fs (dm-2): error: ext3_journal_start_sb: Detected aborted journal

Nov  4 13:36:29 localhost kernel: [137240.264841] EXT3-fs (dm-2): error: remounting filesystem read-only

Nov  4 13:36:29 localhost kernel: [137240.264860] ------------[ cut here ]------------

Nov  4 13:36:29 localhost kernel: [137240.264882] WARNING: at fs/ext3/inode.c:1571 ext3_ordered_writepage+0x3e/0x178()

Nov  4 13:36:29 localhost kernel: [137240.264890] Hardware name: To Be Filled By O.E.M.

Nov  4 13:36:29 localhost kernel: [137240.264896] Modules linked in:

Nov  4 13:36:29 localhost kernel: [137240.264908] Pid: 7478, comm: flush-253:2 Tainted: G        W   2.6.39-gentoo-r3 #1

Nov  4 13:36:29 localhost kernel: [137240.264915] Call Trace:

Nov  4 13:36:29 localhost kernel: [137240.264930]  [<c1030c99>] warn_slowpath_common+0x65/0x7a

Nov  4 13:36:29 localhost kernel: [137240.264941]  [<c1101185>] ? ext3_ordered_writepage+0x3e/0x178

Nov  4 13:36:29 localhost kernel: [137240.264952]  [<c1030cbd>] warn_slowpath_null+0xf/0x13

Nov  4 13:36:29 localhost kernel: [137240.264963]  [<c1101185>] ext3_ordered_writepage+0x3e/0x178

Nov  4 13:36:29 localhost kernel: [137240.264975]  [<c1094e74>] __writepage+0xb/0x23

Nov  4 13:36:29 localhost kernel: [137240.264985]  [<c1095ebd>] write_cache_pages+0x1b2/0x27b

Nov  4 13:36:29 localhost kernel: [137240.264996]  [<c1094e69>] ? set_page_dirty+0x59/0x59

Nov  4 13:36:29 localhost kernel: [137240.265038]  [<c1095fb8>] generic_writepages+0x32/0x46

Nov  4 13:36:29 localhost kernel: [137240.265048]  [<c1095fea>] do_writepages+0x1e/0x24

Nov  4 13:36:29 localhost kernel: [137240.265058]  [<c10d07ea>] writeback_single_inode+0xa6/0x184

Nov  4 13:36:29 localhost kernel: [137240.265068]  [<c10d0cfd>] writeback_sb_inodes+0xa3/0x102

Nov  4 13:36:29 localhost kernel: [137240.265078]  [<c10d11cf>] writeback_inodes_wb+0xcd/0xe3

Nov  4 13:36:29 localhost kernel: [137240.265090]  [<c10d138f>] wb_writeback+0x1aa/0x273

Nov  4 13:36:29 localhost kernel: [137240.265102]  [<c10d15d9>] wb_do_writeback+0x181/0x199

Nov  4 13:36:29 localhost kernel: [137240.265113]  [<c10d168b>] bdi_writeback_thread+0x9a/0x19f

Nov  4 13:36:29 localhost kernel: [137240.265125]  [<c10d15f1>] ? wb_do_writeback+0x199/0x199

Nov  4 13:36:29 localhost kernel: [137240.265133]  [<c10464d1>] kthread+0x62/0x67

Nov  4 13:36:29 localhost kernel: [137240.265143]  [<c104646f>] ? kthread_worker_fn+0x10a/0x10a

Nov  4 13:36:29 localhost kernel: [137240.265154]  [<c15eb2b6>] kernel_thread_helper+0x6/0xd

Nov  4 13:36:29 localhost kernel: [137240.265161] ---[ end trace 0a03d8e813c4f9b8 ]---

Nov  4 13:36:29 localhost kernel: [137240.266385] JBD: I/O error detected when updating journal superblock for dm-2.

Nov  4 13:36:32 localhost kernel: [137243.461088] quiet_error: 32269 callbacks suppressed

Nov  4 13:36:32 localhost kernel: [137243.461096] Buffer I/O error on device dm-2, logical block 1399516329

Nov  4 13:36:32 localhost kernel: [137243.461102] lost page write due to I/O error on dm-2

Nov  4 13:36:32 localhost kernel: [137243.461255] Buffer I/O error on device dm-2, logical block 1399516330

Nov  4 13:36:32 localhost kernel: [137243.461260] lost page write due to I/O error on dm-2

Nov  4 13:36:32 localhost kernel: [137243.461430] Buffer I/O error on device dm-2, logical block 1399516331

Nov  4 13:36:32 localhost kernel: [137243.461435] lost page write due to I/O error on dm-2

Nov  4 13:36:32 localhost kernel: [137243.461602] Buffer I/O error on device dm-2, logical block 1399516332

Nov  4 13:36:32 localhost kernel: [137243.461607] lost page write due to I/O error on dm-2

Nov  4 13:36:32 localhost kernel: [137243.461776] Buffer I/O error on device dm-2, logical block 1399516333

Nov  4 13:36:32 localhost kernel: [137243.461781] lost page write due to I/O error on dm-2

Nov  4 13:36:32 localhost kernel: [137243.461951] Buffer I/O error on device dm-2, logical block 1399516334

Nov  4 13:36:32 localhost kernel: [137243.461956] lost page write due to I/O error on dm-2

Nov  4 13:36:32 localhost kernel: [137243.462165] Buffer I/O error on device dm-2, logical block 1399516335

Nov  4 13:36:32 localhost kernel: [137243.462172] lost page write due to I/O error on dm-2

Nov  4 13:36:32 localhost kernel: [137243.462333] Buffer I/O error on device dm-2, logical block 1399516336

Nov  4 13:36:32 localhost kernel: [137243.462341] lost page write due to I/O error on dm-2

Nov  4 13:36:32 localhost kernel: [137243.462505] Buffer I/O error on device dm-2, logical block 1399516337

Nov  4 13:36:32 localhost kernel: [137243.462513] lost page write due to I/O error on dm-2

Nov  4 13:36:32 localhost kernel: [137243.462678] Buffer I/O error on device dm-2, logical block 1399516338

Nov  4 13:36:32 localhost kernel: [137243.462684] lost page write due to I/O error on dm-2

Nov  4 13:36:32 localhost kernel: [137243.906990] JBD: Detected IO errors while flushing file data on dm-2

Nov  4 13:36:32 localhost kernel: [137243.907055] __journal_remove_journal_head: freeing b_committed_dat
```

So it looks to me like sda failed and then sdg failed a few minutes later. I do not think there is anything physically wrong with the sda disk. I haven't seen any error messages about it.

Now, every re-boot I see this in dmesg:

```
Nov  5 17:33:56 localhost kernel: [   18.537780] md: kicking non-fresh sda1 from array!

Nov  5 17:33:56 localhost kernel: [   18.537915] md: unbind<sda1>
```

mdstat looks like this:

```
localhost ~ # cat /proc/mdstat

Personalities : [raid0] [raid1] [raid6] [raid5] [raid4]

md127 : inactive sdc1[2](S) sdb1[1](S) sdd1[3](S) sda1[0](S) sdg[5](S) sdh[6](S) sdf[4](S)

      13674591232 blocks

unused devices: <none>
```

Also:

```
localhost ~ # mdadm --assemble --scan

mdadm: /dev/md/127_0 assembled from 5 drives - not enough to start the array.

mdadm: No arrays found in config file or automatically

```

I'm thinking I would like to fail and re-add sda1 because I think it got kicked out by some software bug or other issue but is OK hardware. I'm concerned about doing this since the array was probably operating degraded for about 15 minutes until sdg got booted out. The timing can also be seen in the mdadm --examine output for the drives. Do I need to try to fail and re-add sda1 or sdg first?

Would doing a forced create with all drives except sdg (missing) be a good option or will I screww everything up because sda was knocked out first.

I was copying one large file onto the array during this time and I don't care if I lose only that one file.

I have an new extra drive of the same model I just got that I can use to replace the sdg drive but I think I need to do some re-adding first. Any help is much appreciated.

----------

## eccerr0r

The raw value in SMART data doesn't really mean much.  It's manufacturer dependent.  The only thing the user really can read is how close the "VALUE" field and the "WORST" fields compare to the "THRESHold" field.

The "SMART health" I have a feeling is also read from the drive.  If the drive thinks it's OK, then smartctl will also report OK.

I had a recent issue with my RAID as well where one of the disks' data cables got unplugged.  This caused a read failure...Well, it's not a real read failure, the cable got unplugged...  Luckily the intent of the RAID - reduce unexpected downtime - held up.  The machine, despite having a disk dislodged, stayed up.

Not sure if this has anything to do with the failure you're seeing but the SMART data still isn't always conclusive.

Unfortunately I've not had the issue of having to force assemble RAIDs that have inconsistent timestamps... I always backup my RAID to another medium just in case.  Might be able to hack the timestamps on the disks to trick it into think the disks it kicked is still fresh...

----------

## kevintshaver

Thanks for the SMART data info.

As for the RAID, I re-created the array using 6 of the 7 drives, only leaving out the one that was kicked out first:

```
localhost ~ # mdadm --create --verbose /dev/md0 --level=5 --chunk=128 --raid-devices=7 missing /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde /dev/sdf /dev/sdg
```

This seems to work.

```
mdadm: size set to 1953510784K

Continue creating array? y

mdadm: Defaulting to version 1.2 metadata

mdadm: array /dev/md0 started.

localhost ~ # cat /proc/mdstat

Personalities : [raid0] [raid1] [raid6] [raid5] [raid4]

md0 : active raid5 sdg[6] sdf[5] sde[4] sdd1[3] sdc1[2] sdb1[1]

      11721064704 blocks super 1.2 level 5, 128k chunk, algorithm 2 [7/6] [_UUUUUU]

unused devices: <none>

```

Now, I would add the seventh drive back in by running:

```
localhost ~ # mdadm /dev/md0 -a /dev/sda1
```

It starts syncing but fails because sdf is going bad. It's the drive with a high read error count. Any suggestions on how I can get around this? I need it to keep retrying when it hits an error instead of just failing out the whole process.

```
[ 1449.885054] ata2.01: status: { DRDY ERR }

[ 1449.885062] ata2.01: error: { UNC }

[ 1449.894743] ata2.01: configured for UDMA/133

[ 1449.894917] sd 1:1:0:0: [sdf] Unhandled sense code

[ 1449.894923] sd 1:1:0:0: [sdf]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE

[ 1449.894931] sd 1:1:0:0: [sdf]  Sense Key : Medium Error [current] [descriptor]

[ 1449.894940] Descriptor sense data with sense descriptors (in hex):

[ 1449.894945]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00

[ 1449.894964]         00 2e 02 00

[ 1449.894973] sd 1:1:0:0: [sdf]  Add. Sense: Unrecovered read error - auto reallocate failed

[ 1449.894983] sd 1:1:0:0: [sdf] CDB: Read(10): 28 00 00 2e 00 d8 00 03 f8 00

[ 1449.895001] end_request: I/O error, dev sdf, sector 3015168

[ 1449.895023] md/raid:md0: read error not correctable (sector 3015168 on sdf).

[ 1449.895031] md/raid:md0: Disk failure on sdf, disabling device.

[ 1449.895034] md/raid:md0: Operation continuing on 5 devices.

[ 1449.895054] md/raid:md0: read error not correctable (sector 3015176 on sdf).

[ 1449.895068] md/raid:md0: read error not correctable (sector 3015184 on sdf).

[ 1449.895079] md/raid:md0: read error not correctable (sector 3015192 on sdf).

[ 1449.895091] md/raid:md0: read error not correctable (sector 3015200 on sdf).

[ 1449.895105] md/raid:md0: read error not correctable (sector 3015208 on sdf).

[ 1449.895117] md/raid:md0: read error not correctable (sector 3015216 on sdf).

[ 1449.895129] md/raid:md0: read error not correctable (sector 3015224 on sdf).

[ 1449.895141] md/raid:md0: read error not correctable (sector 3015232 on sdf).

[ 1449.895151] md/raid:md0: read error not correctable (sector 3015240 on sdf).

[ 1449.895272] ata2: EH complete

[ 1450.779065] md: md0: recovery done.

```

```
localhost ~ # cat /proc/mdstat

Personalities : [raid0] [raid1] [raid6] [raid5] [raid4]

md0 : active raid5 sda1[7](S) sdg[6] sdf[5](F) sde[4] sdd1[3] sdc1[2] sdb1[1]

      11721064704 blocks super 1.2 level 5, 128k chunk, algorithm 2 [7/5] [_UUUU_U]

unused devices: <none>

```

----------

## NeddySeagoon

kevintshaver,

Write zeros  from /dev/zero to the whole drive first.  The will force sector remapping, so the read errors should vanish.

If you still get read errors the drive is scrap.  That probably indicates the drive has no more spare sectors to map failed sectors to.

----------

## kevintshaver

NeddySeagoon, I will give this a try, but first I need to try to recover the array.

Option 1: get it to resync upon re-adding sda without failing out due to read errors on sdf

Option 2: try to recreate with 6 out of 7 drives with sda included and sdf marked missing. sda is a good drive with no read errors, bad sectors, etc. The problem here is that sda was kicked out first while data was being written to the array. Does anyone know if this option would result in a completely unusable array or if only the data being written after sda was kicked out and before sdf failed would be unrecoverable?

Thanks.

----------

## HeissFuss

 *Quote:*   

> Option 1: get it to resync upon re-adding sda without failing out due to read errors on sdf
> 
> Option 2: try to recreate with 6 out of 7 drives with sda included and sdf marked missing. sda is a good drive with no read errors, bad sectors, etc. The problem here is that sda was kicked out first while data was being written to the array. Does anyone know if this option would result in a completely unusable array or if only the data being written after sda was kicked out and before sdf failed would be unrecoverable? 

 

I don't think option 2 is really valid any more.  I don't think it will assemble with an active drive that was partially synced.  As NeddySeagoon mentioned though, if there aren't too many bad sectors, you could try writing zeros to the bad ones.  Since you actually want to try to save the data though, you'd just want to write the bad ones, not the entire disk (make a list via badblocks, then write via hdparm.)

Option 1: I don't know of any way in the current md version to get it to skip errors (I assume those sectors won't read properly no matter how many times you try.)

Here's another option though:

Use dd-rescue or similar to copy sdf to sda (after making a dd image of sda.)  This will allow you to write zeros on sda corresponding to unrecoverable blocks on sdf.  Assuming there aren't too many bad sectors on sdf, you'll probably have some files with chunks of data missing.  You would have been better off trying to force assemble with sda, but it's probably too late for that now.

----------

## NeddySeagoon

kevintshaver,

If you do some dirty hack to trich a broken raid into assembling with less data than it really needs, you make some of the files on the raid set corrupt.

The raid can't tell that its been tricked, so when you add the redundant drive back, it will sync assumint that all the other (nonredundant) data was correct. Of course, we know its not as you will probably write zeros to the damaged bocks to force remapping.

If you are really really lucky, the read errors are happening in disk space that has not been allocated by the filesystem, so your data will be ok.

If its in space allocated to file, that file, possibly all teh files that share that raid chunk will be corrupt, if a directory is affected, so are all the files in that directory and its sub directory.

If the problem is with the filesystem inode table ...

I suspect that the faulty blocks are in use for something or the raid would not be trying to read them in the first place and dropping drives out of the array as a result of the error(s).

Forcing the array to assemble with a missing drive is harmless - it can't rebuild.  Attempting to mount the filesystem read only is also harmless as long as you also prevent journal recovery, so that there are no writes at all.   

```
mount -o ro
```

 permits the journal to be replayed, so read man mount to get the other option you need.

----------

## kevintshaver

NeddySeagoon,

Thanks. That answers my main question. Right now I'm actually running ddrescue per HeissFuss recommendation and then will rebuild the array using the ddrescued copy.

It's taking forever. I think I'm at like 10+ days now. It rescued about 1995GB pretty quickly and then has been really struggling over that last 5 GB. I'm thinking about stopping it and then doing something to the bad disk (banging it, changing orientation, etc...) and then trying again to see if I can get it to finish that 5GB. It doesn't look like it's making much progress anymore. It gets a successful read maybe every 12 hours or so.

I have another question now. I think there is no reason to write zeros to the ddrescued copy because it does not really have any bad sectors (brand new disk). Is that correct or do I need to write zeros even on the "good" copy.

----------

## NeddySeagoon

kevintshaver,

That last 5G is all bad.  No need to stop it, just change the orentation while it runs.

ddrescue keeps cycling round the unrecovered blocks. It will only terminate on sucess.

----------

