# RAID 10 all drivers marked as spare

## jpp_

Hi, sry for my bad english. I need some expert help

Everything is working fine, and form one moment to another the md0 device stops working.

It consist in 8 4tb drives

```
cat /proc/mdstat 

Personalities : [linear] [raid0] [raid1] [raid10] 

md0 : inactive sdi1[6](S) sde1[2](S) sdf1[3](S) sdg1[5](S) sdj1[7](S) sdh1[4](S) sdc1[0](S) sdd1[1](S)

      31254904832 blocks super 1.2

       

unused devices: <none>

```

i run 

mdadm --stop /dev/md0

and

mdadm -A /dev/md0 --verbose

and i get

```
mdadm: looking for devices for /dev/md0

mdadm: No super block found on /dev/sdj (Expected magic a92b4efc, got 00000000)

mdadm: no RAID superblock on /dev/sdj

mdadm: No super block found on /dev/sdi (Expected magic a92b4efc, got 00000000)

mdadm: no RAID superblock on /dev/sdi

mdadm: No super block found on /dev/sdh (Expected magic a92b4efc, got 00000000)

mdadm: no RAID superblock on /dev/sdh

mdadm: No super block found on /dev/sdg (Expected magic a92b4efc, got 00000000)

mdadm: no RAID superblock on /dev/sdg

mdadm: No super block found on /dev/sdf (Expected magic a92b4efc, got 00000000)

mdadm: no RAID superblock on /dev/sdf

mdadm: No super block found on /dev/sde (Expected magic a92b4efc, got 00000000)

mdadm: no RAID superblock on /dev/sde

mdadm: No super block found on /dev/sdd (Expected magic a92b4efc, got 00000000)

mdadm: no RAID superblock on /dev/sdd

mdadm: No super block found on /dev/sdc (Expected magic a92b4efc, got 00000000)

mdadm: no RAID superblock on /dev/sdc

mdadm: No super block found on /dev/sdb4 (Expected magic a92b4efc, got 0fc02366)

mdadm: no RAID superblock on /dev/sdb4

mdadm: No super block found on /dev/sdb3 (Expected magic a92b4efc, got 00000000)

mdadm: no RAID superblock on /dev/sdb3

mdadm: No super block found on /dev/sdb2 (Expected magic a92b4efc, got 00000000)

mdadm: no RAID superblock on /dev/sdb2

mdadm: No super block found on /dev/sdb1 (Expected magic a92b4efc, got 0fc02366)

mdadm: no RAID superblock on /dev/sdb1

mdadm: No super block found on /dev/sdb (Expected magic a92b4efc, got 00000000)

mdadm: no RAID superblock on /dev/sdb

mdadm: No super block found on /dev/sda2 (Expected magic a92b4efc, got 00000439)

mdadm: no RAID superblock on /dev/sda2

mdadm: No super block found on /dev/sda1 (Expected magic a92b4efc, got 00000000)

mdadm: no RAID superblock on /dev/sda1

mdadm: No super block found on /dev/sda (Expected magic a92b4efc, got 00000000)

mdadm: no RAID superblock on /dev/sda

mdadm: No super block found on /dev/ram15 (Expected magic a92b4efc, got 00000000)

mdadm: no RAID superblock on /dev/ram15

mdadm: No super block found on /dev/ram14 (Expected magic a92b4efc, got 00000000)

mdadm: no RAID superblock on /dev/ram14

mdadm: No super block found on /dev/ram13 (Expected magic a92b4efc, got 00000000)

mdadm: no RAID superblock on /dev/ram13

mdadm: No super block found on /dev/ram12 (Expected magic a92b4efc, got 00000000)

mdadm: no RAID superblock on /dev/ram12

mdadm: No super block found on /dev/ram11 (Expected magic a92b4efc, got 00000000)

mdadm: no RAID superblock on /dev/ram11

mdadm: No super block found on /dev/ram10 (Expected magic a92b4efc, got 00000000)

mdadm: no RAID superblock on /dev/ram10

mdadm: No super block found on /dev/ram9 (Expected magic a92b4efc, got 00000000)

mdadm: no RAID superblock on /dev/ram9

mdadm: No super block found on /dev/ram8 (Expected magic a92b4efc, got 00000000)

mdadm: no RAID superblock on /dev/ram8

mdadm: No super block found on /dev/ram7 (Expected magic a92b4efc, got 00000000)

mdadm: no RAID superblock on /dev/ram7

mdadm: No super block found on /dev/ram6 (Expected magic a92b4efc, got 00000000)

mdadm: no RAID superblock on /dev/ram6

mdadm: No super block found on /dev/ram5 (Expected magic a92b4efc, got 00000000)

mdadm: no RAID superblock on /dev/ram5

mdadm: No super block found on /dev/ram4 (Expected magic a92b4efc, got 00000000)

mdadm: no RAID superblock on /dev/ram4

mdadm: No super block found on /dev/ram3 (Expected magic a92b4efc, got 00000000)

mdadm: no RAID superblock on /dev/ram3

mdadm: No super block found on /dev/ram2 (Expected magic a92b4efc, got 00000000)

mdadm: no RAID superblock on /dev/ram2

mdadm: No super block found on /dev/ram1 (Expected magic a92b4efc, got 00000000)

mdadm: no RAID superblock on /dev/ram1

mdadm: No super block found on /dev/ram0 (Expected magic a92b4efc, got 00000000)

mdadm: no RAID superblock on /dev/ram0

mdadm: /dev/sdj1 is identified as a member of /dev/md0, slot 7.

mdadm: /dev/sdi1 is identified as a member of /dev/md0, slot 6.

mdadm: /dev/sdh1 is identified as a member of /dev/md0, slot 4.

mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot 5.

mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 3.

mdadm: /dev/sde1 is identified as a member of /dev/md0, slot 2.

mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot 1.

mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 0.

mdadm: added /dev/sdd1 to /dev/md0 as 1

mdadm: added /dev/sde1 to /dev/md0 as 2

mdadm: added /dev/sdf1 to /dev/md0 as 3

mdadm: added /dev/sdh1 to /dev/md0 as 4 (possibly out of date)

mdadm: added /dev/sdg1 to /dev/md0 as 5 (possibly out of date)

mdadm: added /dev/sdi1 to /dev/md0 as 6

mdadm: added /dev/sdj1 to /dev/md0 as 7

mdadm: added /dev/sdc1 to /dev/md0 as 0

mdadm: /dev/md0 assembled from 6 drives - not enough to start the array.
```

cat /etc/mdadm.conf

```
DEVICE partitions

ARRAY /dev/md0 metadata=1.2 name=SAMSARA:0 UUID=01c86d33:50403b16:40e0da06:a7e3510a
```

I run smartctl -H for all members and it return PASSED

mdadm --examine /dev/sd[cdefghij][1]

```
/dev/sdc1:

          Magic : a92b4efc

        Version : 1.2

    Feature Map : 0x1

     Array UUID : 01c86d33:50403b16:40e0da06:a7e3510a

           Name : SAMSARA:0  (local to host SAMSARA)

  Creation Time : Sat Mar  4 20:42:41 2017

     Raid Level : raid10

   Raid Devices : 8

 Avail Dev Size : 7813726208 (3725.88 GiB 4000.63 GB)

     Array Size : 15627452416 (14903.50 GiB 16002.51 GB)

    Data Offset : 262144 sectors

   Super Offset : 8 sectors

   Unused Space : before=262056 sectors, after=0 sectors

          State : clean

    Device UUID : ed2cd418:9ad99b32:f96d201a:1ecb5483

Internal Bitmap : 8 sectors from superblock

    Update Time : Tue Apr  4 20:01:58 2017

  Bad Block Log : 512 entries available at offset 72 sectors

       Checksum : cf6eac39 - correct

         Events : 81935

         Layout : far=2

     Chunk Size : 512K

   Device Role : Active device 0

   Array State : AAAA.AAA ('A' == active, '.' == missing, 'R' == replacing)

/dev/sdd1:

          Magic : a92b4efc

        Version : 1.2

    Feature Map : 0x1

     Array UUID : 01c86d33:50403b16:40e0da06:a7e3510a

           Name : SAMSARA:0  (local to host SAMSARA)

  Creation Time : Sat Mar  4 20:42:41 2017

     Raid Level : raid10

   Raid Devices : 8

 Avail Dev Size : 7813726208 (3725.88 GiB 4000.63 GB)

     Array Size : 15627452416 (14903.50 GiB 16002.51 GB)

    Data Offset : 262144 sectors

   Super Offset : 8 sectors

   Unused Space : before=262056 sectors, after=0 sectors

          State : clean

    Device UUID : 9c1d4e4f:9b4f461b:e911528b:5c20dd61

Internal Bitmap : 8 sectors from superblock

    Update Time : Tue Apr  4 20:01:58 2017

  Bad Block Log : 512 entries available at offset 72 sectors

       Checksum : 3e4d0c19 - correct

         Events : 81935

         Layout : far=2

     Chunk Size : 512K

   Device Role : Active device 1

   Array State : AAAA.AAA ('A' == active, '.' == missing, 'R' == replacing)

/dev/sde1:

          Magic : a92b4efc

        Version : 1.2

    Feature Map : 0x1

     Array UUID : 01c86d33:50403b16:40e0da06:a7e3510a

           Name : SAMSARA:0  (local to host SAMSARA)

  Creation Time : Sat Mar  4 20:42:41 2017

     Raid Level : raid10

   Raid Devices : 8

 Avail Dev Size : 7813726208 (3725.88 GiB 4000.63 GB)

     Array Size : 15627452416 (14903.50 GiB 16002.51 GB)

    Data Offset : 262144 sectors

   Super Offset : 8 sectors

   Unused Space : before=262056 sectors, after=0 sectors

          State : clean

    Device UUID : b149ff1d:5734f435:1310609a:bdcde3b3

Internal Bitmap : 8 sectors from superblock

    Update Time : Tue Apr  4 20:01:58 2017

  Bad Block Log : 512 entries available at offset 72 sectors

       Checksum : 88c0c876 - correct

         Events : 81935

         Layout : far=2

     Chunk Size : 512K

   Device Role : Active device 2

   Array State : AAAA.AAA ('A' == active, '.' == missing, 'R' == replacing)

/dev/sdf1:

          Magic : a92b4efc

        Version : 1.2

    Feature Map : 0x1

     Array UUID : 01c86d33:50403b16:40e0da06:a7e3510a

           Name : SAMSARA:0  (local to host SAMSARA)

  Creation Time : Sat Mar  4 20:42:41 2017

     Raid Level : raid10

   Raid Devices : 8

 Avail Dev Size : 7813726208 (3725.88 GiB 4000.63 GB)

     Array Size : 15627452416 (14903.50 GiB 16002.51 GB)

    Data Offset : 262144 sectors

   Super Offset : 8 sectors

   Unused Space : before=262056 sectors, after=0 sectors

          State : clean

    Device UUID : 4859a760:210baa65:3d511039:47210e1e

Internal Bitmap : 8 sectors from superblock

    Update Time : Tue Apr  4 20:01:58 2017

  Bad Block Log : 512 entries available at offset 72 sectors

       Checksum : 3f9438c - correct

         Events : 81935

         Layout : far=2

     Chunk Size : 512K

   Device Role : Active device 3

   Array State : AAAA.AAA ('A' == active, '.' == missing, 'R' == replacing)

/dev/sdg1:

          Magic : a92b4efc

        Version : 1.2

    Feature Map : 0x1

     Array UUID : 01c86d33:50403b16:40e0da06:a7e3510a

           Name : SAMSARA:0  (local to host SAMSARA)

  Creation Time : Sat Mar  4 20:42:41 2017

     Raid Level : raid10

   Raid Devices : 8

 Avail Dev Size : 7813726208 (3725.88 GiB 4000.63 GB)

     Array Size : 15627452416 (14903.50 GiB 16002.51 GB)

    Data Offset : 262144 sectors

   Super Offset : 8 sectors

   Unused Space : before=262056 sectors, after=0 sectors

          State : active

    Device UUID : 7491a021:b72102d8:af4afc18:13e6992f

Internal Bitmap : 8 sectors from superblock

    Update Time : Tue Apr  4 19:31:16 2017

  Bad Block Log : 512 entries available at offset 72 sectors

       Checksum : 28c2460c - correct

         Events : 81087

         Layout : far=2

     Chunk Size : 512K

   Device Role : Active device 5

   Array State : AAAA.AAA ('A' == active, '.' == missing, 'R' == replacing)

/dev/sdh1:

          Magic : a92b4efc

        Version : 1.2

    Feature Map : 0x1

     Array UUID : 01c86d33:50403b16:40e0da06:a7e3510a

           Name : SAMSARA:0  (local to host SAMSARA)

  Creation Time : Sat Mar  4 20:42:41 2017

     Raid Level : raid10

   Raid Devices : 8

 Avail Dev Size : 7813726208 (3725.88 GiB 4000.63 GB)

     Array Size : 15627452416 (14903.50 GiB 16002.51 GB)

    Data Offset : 262144 sectors

   Super Offset : 8 sectors

   Unused Space : before=262056 sectors, after=0 sectors

          State : active

    Device UUID : 5de7059a:efd220fe:5892cabb:f5113b83

Internal Bitmap : 8 sectors from superblock

    Update Time : Sat Mar 25 22:18:51 2017

  Bad Block Log : 512 entries available at offset 72 sectors

       Checksum : bda6d89a - correct

         Events : 23636

         Layout : far=2

     Chunk Size : 512K

   Device Role : Active device 4

   Array State : AAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)

/dev/sdi1:

          Magic : a92b4efc

        Version : 1.2

    Feature Map : 0x1

     Array UUID : 01c86d33:50403b16:40e0da06:a7e3510a

           Name : SAMSARA:0  (local to host SAMSARA)

  Creation Time : Sat Mar  4 20:42:41 2017

     Raid Level : raid10

   Raid Devices : 8

 Avail Dev Size : 7813726208 (3725.88 GiB 4000.63 GB)

     Array Size : 15627452416 (14903.50 GiB 16002.51 GB)

    Data Offset : 262144 sectors

   Super Offset : 8 sectors

   Unused Space : before=262056 sectors, after=0 sectors

          State : clean

    Device UUID : 2f3185f1:7f28054e:c7ca2e6c:1951893d

Internal Bitmap : 8 sectors from superblock

    Update Time : Tue Apr  4 20:01:58 2017

  Bad Block Log : 512 entries available at offset 72 sectors

       Checksum : cfcbe230 - correct

         Events : 81935

         Layout : far=2

     Chunk Size : 512K

   Device Role : Active device 6

   Array State : AAAA.AAA ('A' == active, '.' == missing, 'R' == replacing)

/dev/sdj1:

          Magic : a92b4efc

        Version : 1.2

    Feature Map : 0x1

     Array UUID : 01c86d33:50403b16:40e0da06:a7e3510a

           Name : SAMSARA:0  (local to host SAMSARA)

  Creation Time : Sat Mar  4 20:42:41 2017

     Raid Level : raid10

   Raid Devices : 8

 Avail Dev Size : 7813726208 (3725.88 GiB 4000.63 GB)

     Array Size : 15627452416 (14903.50 GiB 16002.51 GB)

    Data Offset : 262144 sectors

   Super Offset : 8 sectors

   Unused Space : before=262056 sectors, after=0 sectors

          State : clean

    Device UUID : ec9b9519:32963113:36468058:c86e42b0

Internal Bitmap : 8 sectors from superblock

    Update Time : Tue Apr  4 20:01:58 2017

  Bad Block Log : 512 entries available at offset 72 sectors

       Checksum : 1c1353bf - correct

         Events : 81935

         Layout : far=2

     Chunk Size : 512K

   Device Role : Active device 7

   Array State : AAAA.AAA ('A' == active, '.' == missing, 'R' == replacing)

```

A side note:

The command i use to creatte the array in the past was:

```
mdadm --create --verbose --level=10 --metadata=1.2 --chunk=512 --raid-devices=8 --layout=f2 /dev/md0 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1  /dev/sdh1  /dev/sdi1 /dev/sdj1
```

And to format it:

```
mkfs.ext4 -v -L DATOS -m 0.01 -b 4096 -E stride=128,stripe-width=1024 /dev/md0
```

i have mdraid loaded at boot level, but always shows a red *, i believe that was because kernel autodetect the arrays before mdraid do, so its already activated.

How can i reactivate the array?

Update:

mdadm --stop /dev/md0

```
mdadm: stopped /dev/md0
```

mdadm --assemble /dev/md0 /dev/sd[cdefghij]1

```
mdadm: /dev/sdc1 is busy - skipping

mdadm: /dev/sdd1 is busy - skipping

mdadm: /dev/sde1 is busy - skipping

mdadm: /dev/sdf1 is busy - skipping

mdadm: /dev/sdg1 is busy - skipping

mdadm: /dev/sdh1 is busy - skipping

mdadm: /dev/sdi1 is busy - skipping

mdadm: /dev/sdj1 is busy - skipping
```

I think bussy because i have parted open.

And then i try in the order it shows in --examine

mdadm --assemble /dev/md0 /dev/sd[cdefhgij]1

```
mdadm: /dev/md0 assembled from 6 drives - not enough to start the array.
```

Thanks very muchLast edited by jpp_ on Wed Apr 05, 2017 12:06 am; edited 1 time in total

----------

## NeddySeagoon

juampii,

The problem elements are

```
/dev/sdh1:

...

    Update Time : Sat Mar 25 22:18:51 2017 

         Events : 23636

/dev/sdg1:

...

    Update Time : Tue Apr  4 19:31:16 2017

         Events : 81087 

```

For reference, a current element is

```
/dev/sdc1: 

    Update Time : Tue Apr  4 20:01:58 2017

          Events : 81935 
```

Your raid has been running in degraded mode since Sat Mar 25 22:18:51 2017.

Then on Tue Apr  4 19:31:16 2017 its partner element went offline so both mirrors were lost.

```
mdadm: added /dev/sdh1 to /dev/md0 as 4 (possibly out of date)

mdadm: added /dev/sdg1 to /dev/md0 as 5 (possibly out of date) 
```

Kernel raid autodetect is only for --metadata=0.9 raid.

The right course of action is to restore from your backups. 

I guess you can't do that or you would not be posting, so what do you hope to achieve?

What IO errors were in dmesg that caused  /dev/sdh1 and /dev/sdg1 to be kicked out of the array?

----------

## frostschutz

 *juampii wrote:*   

> 
> 
> I run smartctl -H for all members and it return PASSED
> 
> 

 

That unfortunately doesn't mean anything. smartctl -H almost always says PASSED even for completely broken disks.

You have to look at full `smartctl -a` output. Does it have any Reallocated, Pending, Offline Uncorrectable sectors?

smartmontools/smartd should be setup to run selftests regularly on your disk and notify you by email when errors occur.

mdadm monitor should be setup to notify you by email when disks get kicked from your RAID.

Do you have MAILADDR in your mdadm.conf? Doesn't look like it, so travel back in time and add it.  And make sure mdadm monitor is running.

For data recovery the linux raid wiki suggests creating overlays https://raid.wiki.kernel.org/index.php/Recovering_a_damaged_RAID#Making_the_harddisks_read-only_using_an_overlay_file

Those overlays allow you run experiments (force assemble, create, fsck, ...) without making any real writes.

This works as long the disks themselves are okay. If you have bad sectors you should ddrescue to fresh disk. The overlay does not handle read errors and there's a risk the disk will die completely.

You have one completely outdated disk and one slightly outdated disk, so your first experiment you run could be a force assemble that leaves out the completely outdated one.

```

mdadm --stop /dev/md*

mdadm --assemble --force /dev/md0 /dev/mapper/overlay_sd[cdefgij]1 # sdh is missing on purpose here

```

----------

## jpp_

 *NeddySeagoon wrote:*   

> juampii,
> 
> The right course of action is to restore from your backups. 
> 
> I guess you can't do that or you would not be posting, so what do you hope to achieve?
> ...

 

Hi Neddy and frostschutz thanks for answering

I have an outdated backup with most of the files, im trying to restore the array to a safe state, i think the disk are okay because they are new, but i dont know, im going to follow some indications from frostschutz.

Should i post the output of dmesg?

Here it is 

https://pastebin.com/T9mh6nRA

smartctl -a /dev/sdh

```
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.10.5-gentoo] (local build)

Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===

Device Model:     WDC WD4004FZWX-00GBGB0

Serial Number:    N8GEY3KY

LU WWN Device Id: 5 000cca 244c6561e

Firmware Version: 81.H0A81

User Capacity:    4,000,787,030,016 bytes [4.00 TB]

Sector Sizes:     512 bytes logical, 4096 bytes physical

Rotation Rate:    7200 rpm

Form Factor:      3.5 inches

Device is:        Not in smartctl database [for details use: -P showall]

ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4

SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)

Local Time is:    Wed Apr  5 10:08:07 2017 -03

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status:  (0x80)   Offline data collection activity

               was never started.

               Auto Offline Data Collection: Enabled.

Self-test execution status:      (   0)   The previous self-test routine completed

               without error or no self-test has ever 

               been run.

Total time to complete Offline 

data collection:       (  113) seconds.

Offline data collection

capabilities:           (0x5b) SMART execute Offline immediate.

               Auto Offline data collection on/off support.

               Suspend Offline collection upon new

               command.

               Offline surface scan supported.

               Self-test supported.

               No Conveyance Self-test supported.

               Selective Self-test supported.

SMART capabilities:            (0x0003)   Saves SMART data before entering

               power-saving mode.

               Supports SMART auto save timer.

Error logging capability:        (0x01)   Error logging supported.

               General Purpose Logging supported.

Short self-test routine 

recommended polling time:     (   2) minutes.

Extended self-test routine

recommended polling time:     ( 571) minutes.

SCT capabilities:           (0x0035)   SCT Status supported.

               SCT Feature Control supported.

               SCT Data Table supported.

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0

  2 Throughput_Performance  0x0005   136   136   054    Pre-fail  Offline      -       108

  3 Spin_Up_Time            0x0007   137   137   024    Pre-fail  Always       -       398 (Average 378)

  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       187

  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0

  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0

  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18

  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       614

 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0

 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       187

192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       197

193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       197

194 Temperature_Celsius     0x0002   181   181   000    Old_age   Always       -       33 (Min/Max 23/46)

196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0

197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       1

SMART Error Log Version: 1

ATA Error Count: 1

   CR = Command Register [HEX]

   FR = Features Register [HEX]

   SC = Sector Count Register [HEX]

   SN = Sector Number Register [HEX]

   CL = Cylinder Low Register [HEX]

   CH = Cylinder High Register [HEX]

   DH = Device/Head Register [HEX]

   DC = Device Command Register [HEX]

   ER = Error register [HEX]

   ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 469 hours (19 days + 13 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  84 43 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  61 28 e8 00 58 b6 40 08      02:57:54.802  WRITE FPDMA QUEUED

  61 00 18 00 84 94 40 08      02:57:54.795  WRITE FPDMA QUEUED

  61 20 10 00 54 94 40 08      02:57:54.795  WRITE FPDMA QUEUED

  61 50 08 00 84 b6 40 08      02:57:54.795  WRITE FPDMA QUEUED

  61 00 00 00 78 b6 40 08      02:57:54.795  WRITE FPDMA QUEUED

SMART Self-test log structure revision number 1

No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

```

smartctl -a /dev/sdg

```
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.10.5-gentoo] (local build)

Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===

Device Model:     WDC WD4004FZWX-00GBGB0

Serial Number:    K4H8V66B

LU WWN Device Id: 5 000cca 25dd21cd3

Firmware Version: 81.H0A81

User Capacity:    4,000,787,030,016 bytes [4.00 TB]

Sector Sizes:     512 bytes logical, 4096 bytes physical

Rotation Rate:    7200 rpm

Form Factor:      3.5 inches

Device is:        Not in smartctl database [for details use: -P showall]

ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4

SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)

Local Time is:    Wed Apr  5 10:09:50 2017 -03

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status:  (0x80)   Offline data collection activity

               was never started.

               Auto Offline Data Collection: Enabled.

Self-test execution status:      (   0)   The previous self-test routine completed

               without error or no self-test has ever 

               been run.

Total time to complete Offline 

data collection:       (  113) seconds.

Offline data collection

capabilities:           (0x5b) SMART execute Offline immediate.

               Auto Offline data collection on/off support.

               Suspend Offline collection upon new

               command.

               Offline surface scan supported.

               Self-test supported.

               No Conveyance Self-test supported.

               Selective Self-test supported.

SMART capabilities:            (0x0003)   Saves SMART data before entering

               power-saving mode.

               Supports SMART auto save timer.

Error logging capability:        (0x01)   Error logging supported.

               General Purpose Logging supported.

Short self-test routine 

recommended polling time:     (   2) minutes.

Extended self-test routine

recommended polling time:     ( 571) minutes.

SCT capabilities:           (0x0035)   SCT Status supported.

               SCT Feature Control supported.

               SCT Data Table supported.

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0

  2 Throughput_Performance  0x0005   136   136   054    Pre-fail  Offline      -       108

  3 Spin_Up_Time            0x0007   148   148   024    Pre-fail  Always       -       390 (Average 330)

  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       114

  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0

  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0

  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18

  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       480

 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0

 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       114

192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       123

193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       123

194 Temperature_Celsius     0x0002   176   176   000    Old_age   Always       -       34 (Min/Max 24/43)

196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0

197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       1

SMART Error Log Version: 1

ATA Error Count: 1

   CR = Command Register [HEX]

   FR = Features Register [HEX]

   SC = Sector Count Register [HEX]

   SN = Sector Number Register [HEX]

   CL = Cylinder Low Register [HEX]

   CH = Cylinder High Register [HEX]

   DH = Device/Head Register [HEX]

   DC = Device Command Register [HEX]

   ER = Error register [HEX]

   ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 472 hours (19 days + 16 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  84 43 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  61 00 c8 00 6c 1b 40 08      07:45:03.131  WRITE FPDMA QUEUED

  61 00 e0 00 90 11 40 08      07:45:03.130  WRITE FPDMA QUEUED

  61 00 d8 00 90 ef 40 08      07:45:03.130  WRITE FPDMA QUEUED

  61 00 d0 00 6c 3d 40 08      07:45:03.130  WRITE FPDMA QUEUED

  ea 00 00 00 00 00 a0 08      07:45:03.128  FLUSH CACHE EXT

SMART Self-test log structure revision number 1

No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

```

----------

## jpp_

Im trying to create the overlay, but i have some doubts.

Any advice would be appreciated

i run

# UUID=$(mdadm -E /dev/sdd1|perl -ne '/Array UUID : (\S+)/ and print $1')

# echo $UUID

01c86d33:50403b16:40e0da06:a7e3510a

DEVICES=$(cat /proc/partitions | parallel --tagstring {5} --colsep ' +' mdadm -E /dev/{5} |grep $UUID | parallel --colsep '\t' echo /dev/{1})

```
   mdadm: cannot open /dev/: Invalid argument

   mdadm: cannot open /dev/: Invalid argument

ram0   mdadm: No md superblock detected on /dev/ram0.

ram1   mdadm: No md superblock detected on /dev/ram1.

ram2   mdadm: No md superblock detected on /dev/ram2.

ram3   mdadm: No md superblock detected on /dev/ram3.

ram4   mdadm: No md superblock detected on /dev/ram4.

ram5   mdadm: No md superblock detected on /dev/ram5.

ram6   mdadm: No md superblock detected on /dev/ram6.

ram7   mdadm: No md superblock detected on /dev/ram7.

ram8   mdadm: No md superblock detected on /dev/ram8.

ram9   mdadm: No md superblock detected on /dev/ram9.

ram10   mdadm: No md superblock detected on /dev/ram10.

ram11   mdadm: No md superblock detected on /dev/ram11.

ram12   mdadm: No md superblock detected on /dev/ram12.

ram13   mdadm: No md superblock detected on /dev/ram13.

ram14   mdadm: No md superblock detected on /dev/ram14.

ram15   mdadm: No md superblock detected on /dev/ram15.

sda2   mdadm: No md superblock detected on /dev/sda2.

sdb3   mdadm: No md superblock detected on /dev/sdb3.

sdl1   mdadm: No md superblock detected on /dev/sdl1.

```

# echo $DEVICES

```
/dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1 /dev/sdj1
```

I do not have enough space for 8 x 4 TB, but in the wiki says

 *Quote:*   

> (usually 1% of the harddisk capacity is sufficient)

 

I mount an etx4 filesystem of 3000g in /mnt/3TB

So now the part i dont do yet, waiting for some advice before running the commands.

I need to:

1) Create loop devices

```
parallel 'test -e /dev/loop{#} || mknod -m 660 /dev/loop{#} b 7 {#}' ::: $DEVICES
```

2) create an overlay file for each device

```
parallel truncate -s300G overlay-{/mnt/3TB/} ::: $DEVICES
```

3) Setup the loop-device and the overlay device

```
parallel 'size=$(blockdev --getsize {}); loop=$(losetup -f --show -- overlay-{/mnt/3TB/}); echo 0 $size snapshot {} $loop P 8 | dmsetup create {/}' ::: $DEVICES
```

4) I suppose after this, the overlay devices are going to be in /dev/mapper/*

Running this

```
 $ OVERLAYS=$(parallel echo /dev/mapper/{/} ::: $DEVICES)

 $ echo $OVERLAYS 
```

is going to tell me the mapper devices i should use

and check the disk ussage with dmsetup status

After this, what should i run?

This?

```
mdadm --stop /dev/md* 

mdadm --assemble --force /dev/md0 /dev/mapper/overlay_sd[cdefgij]1
```

Replacing overlay_sd with the real name i get in 4)?

Thanks

----------

## NeddySeagoon

juampii,

Here are the important numbers. Both drives appear to be good

```
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0

196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0

197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
```

When the Current_Pending_Sector is non zero, the drive knows about sectors it can no longer read. The drive is scrap but smartclt -H will still say pass.

Your 

```
Error: ICRC, ABRT at LBA = 0x00000000 = 0
```

is not one I've seen but both drives have it.

There is no point in getting your raid set online without fixing that.  It appears to occur in your raid set about once every two weeks, so that's not usable.

Google suggests that it may be BIOS or interface related.  

Please post your lspci and explain how each drive is connected to the system.

dmesg may be useful - please put it onto a pastebin.

----------

## jpp_

dmesg

https://pastebin.com/T9mh6nRA

lspci

```
00:00.0 Host bridge: Intel Corporation Xeon E5/Core i7 DMI2 (rev 07)

00:01.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 1a (rev 07)

00:02.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 2a (rev 07)

00:03.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 3a in PCI Express Mode (rev 07)

00:03.2 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 3c (rev 07)

00:05.0 System peripheral: Intel Corporation Xeon E5/Core i7 Address Map, VTd_Misc, System Management (rev 07)

00:05.2 System peripheral: Intel Corporation Xeon E5/Core i7 Control Status and Global Errors (rev 07)

00:05.4 PIC: Intel Corporation Xeon E5/Core i7 I/O APIC (rev 07)

00:11.0 PCI bridge: Intel Corporation C600/X79 series chipset PCI Express Virtual Root Port (rev 06)

00:16.0 Communication controller: Intel Corporation C600/X79 series chipset MEI Controller #1 (rev 05)

00:19.0 Ethernet controller: Intel Corporation 82579V Gigabit Network Connection (rev 06)

00:1a.0 USB controller: Intel Corporation C600/X79 series chipset USB2 Enhanced Host Controller #2 (rev 06)

00:1c.0 PCI bridge: Intel Corporation C600/X79 series chipset PCI Express Root Port 1 (rev b6)

00:1c.1 PCI bridge: Intel Corporation C600/X79 series chipset PCI Express Root Port 2 (rev b6)

00:1c.2 PCI bridge: Intel Corporation C600/X79 series chipset PCI Express Root Port 3 (rev b6)

00:1c.3 PCI bridge: Intel Corporation C600/X79 series chipset PCI Express Root Port 4 (rev b6)

00:1c.4 PCI bridge: Intel Corporation C600/X79 series chipset PCI Express Root Port 5 (rev b6)

00:1c.5 PCI bridge: Intel Corporation C600/X79 series chipset PCI Express Root Port 6 (rev b6)

00:1c.7 PCI bridge: Intel Corporation C600/X79 series chipset PCI Express Root Port 8 (rev b6)

00:1d.0 USB controller: Intel Corporation C600/X79 series chipset USB2 Enhanced Host Controller #1 (rev 06)

00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev a6)

00:1f.0 ISA bridge: Intel Corporation C600/X79 series chipset LPC Controller (rev 06)

00:1f.2 SATA controller: Intel Corporation C600/X79 series chipset 6-Port SATA AHCI Controller (rev 06)

00:1f.3 SMBus: Intel Corporation C600/X79 series chipset SMBus Host Controller (rev 06)

01:00.0 VGA compatible controller: NVIDIA Corporation GM107 [GeForce GTX 750 Ti] (rev a2)

01:00.1 Audio device: NVIDIA Corporation Device 0fbc (rev a1)

02:00.0 PCI bridge: PLX Technology, Inc. PEX8112 x1 Lane PCI Express-to-PCI Bridge (rev aa)

03:04.0 Multimedia audio controller: C-Media Electronics Inc CMI8788 [Oxygen HD Audio]

04:00.0 VGA compatible controller: NVIDIA Corporation GM204 [GeForce GTX 980] (rev a1)

04:00.1 Audio device: NVIDIA Corporation GM204 High Definition Audio Controller (rev a1)

05:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9123 PCIe SATA 6.0 Gb/s controller (rev 11)

05:00.1 IDE interface: Marvell Technology Group Ltd. 88SE912x IDE Controller (rev 11)

08:00.0 USB controller: ASMedia Technology Inc. ASM1042 SuperSpeed USB Host Controller

09:00.0 USB controller: ASMedia Technology Inc. ASM1042 SuperSpeed USB Host Controller

0a:00.0 USB controller: ASMedia Technology Inc. ASM1042 SuperSpeed USB Host Controller

0b:00.0 USB controller: ASMedia Technology Inc. ASM1042 SuperSpeed USB Host Controller

0c:00.0 SATA controller: ASMedia Technology Inc. ASM1062 Serial ATA Controller (rev 01)

0d:00.0 SATA controller: ASMedia Technology Inc. ASM1062 Serial ATA Controller (rev 01)

ff:08.0 System peripheral: Intel Corporation Xeon E5/Core i7 QPI Link 0 (rev 07)

ff:08.3 System peripheral: Intel Corporation Xeon E5/Core i7 QPI Link Reut 0 (rev 07)

ff:08.4 System peripheral: Intel Corporation Xeon E5/Core i7 QPI Link Reut 0 (rev 07)

ff:09.0 System peripheral: Intel Corporation Xeon E5/Core i7 QPI Link 1 (rev 07)

ff:09.3 System peripheral: Intel Corporation Xeon E5/Core i7 QPI Link Reut 1 (rev 07)

ff:09.4 System peripheral: Intel Corporation Xeon E5/Core i7 QPI Link Reut 1 (rev 07)

ff:0a.0 System peripheral: Intel Corporation Xeon E5/Core i7 Power Control Unit 0 (rev 07)

ff:0a.1 System peripheral: Intel Corporation Xeon E5/Core i7 Power Control Unit 1 (rev 07)

ff:0a.2 System peripheral: Intel Corporation Xeon E5/Core i7 Power Control Unit 2 (rev 07)

ff:0a.3 System peripheral: Intel Corporation Xeon E5/Core i7 Power Control Unit 3 (rev 07)

ff:0b.0 System peripheral: Intel Corporation Xeon E5/Core i7 Interrupt Control Registers (rev 07)

ff:0b.3 System peripheral: Intel Corporation Xeon E5/Core i7 Semaphore and Scratchpad Configuration Registers (rev 07)

ff:0c.0 System peripheral: Intel Corporation Xeon E5/Core i7 Unicast Register 0 (rev 07)

ff:0c.1 System peripheral: Intel Corporation Xeon E5/Core i7 Unicast Register 0 (rev 07)

ff:0c.2 System peripheral: Intel Corporation Xeon E5/Core i7 Unicast Register 0 (rev 07)

ff:0c.6 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller System Address Decoder 0 (rev 07)

ff:0c.7 System peripheral: Intel Corporation Xeon E5/Core i7 System Address Decoder (rev 07)

ff:0d.0 System peripheral: Intel Corporation Xeon E5/Core i7 Unicast Register 0 (rev 07)

ff:0d.1 System peripheral: Intel Corporation Xeon E5/Core i7 Unicast Register 0 (rev 07)

ff:0d.2 System peripheral: Intel Corporation Xeon E5/Core i7 Unicast Register 0 (rev 07)

ff:0d.6 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller System Address Decoder 1 (rev 07)

ff:0e.0 System peripheral: Intel Corporation Xeon E5/Core i7 Processor Home Agent (rev 07)

ff:0e.1 Performance counters: Intel Corporation Xeon E5/Core i7 Processor Home Agent Performance Monitoring (rev 07)

ff:0f.0 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Registers (rev 07)

ff:0f.1 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller RAS Registers (rev 07)

ff:0f.2 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 0 (rev 07)

ff:0f.3 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 1 (rev 07)

ff:0f.4 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 2 (rev 07)

ff:0f.5 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 3 (rev 07)

ff:0f.6 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 4 (rev 07)

ff:10.0 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 0 (rev 07)

ff:10.1 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 1 (rev 07)

ff:10.2 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 0 (rev 07)

ff:10.3 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 1 (rev 07)

ff:10.4 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 2 (rev 07)

ff:10.5 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 3 (rev 07)

ff:10.6 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 2 (rev 07)

ff:10.7 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 3 (rev 07)

ff:11.0 System peripheral: Intel Corporation Xeon E5/Core i7 DDRIO (rev 07)

ff:13.0 System peripheral: Intel Corporation Xeon E5/Core i7 R2PCIe (rev 07)

ff:13.1 Performance counters: Intel Corporation Xeon E5/Core i7 Ring to PCI Express Performance Monitor (rev 07)

ff:13.4 Performance counters: Intel Corporation Xeon E5/Core i7 QuickPath Interconnect Agent Ring Registers (rev 07)

ff:13.5 Performance counters: Intel Corporation Xeon E5/Core i7 Ring to QuickPath Interconnect Link 0 Performance Monitor (rev 07)

ff:13.6 System peripheral: Intel Corporation Xeon E5/Core i7 Ring to QuickPath Interconnect Link 1 Performance Monitor (rev 07)
```

My mainboard has 8 sata ports. (Model is Asus rampage IV extreme)

1 - SSD with gentoo

2 - SSD with Windows for a VM

3 / 8 - 4TB Raid Members

and i have a pci-e sata card with 2 4tb raid members drives atached to it, maybe the 2 failling disk are atached to it?

And i use the kernel driver, vfio-pci, to passtrough a vga to be used by the VM.

Maybe the error can be related to some overclock i do to the bclk (its related to pcie ports), now its reverted to an estable 4.5 ghz i've been using for years. Or maybe the sata card.

----------

## frostschutz

If you scroll down on the wiki a little you should find two overlay create/remove functions that should be more convenient to use. (As long as you know what's happening.)

The mknod part should not be necessary on any modern udev system, loop devices appear as needed.

overlay-{/mnt/3TB/} is not correct, you should cd /mnt/3TB in the first place or write /mnt/3TB/overlay-{/} ( {/} to be replaced by sda sdb sdc etc. the ::: PARAMETER LIST). Read a bit of the parallel manpage / examples to understand the special meaning of these {} parameters.

----------

## jpp_

frostschutz, thanks for your time

so i ommit point 1)?

then 

```
cd /mnt/3TB
```

```
parallel truncate -s300G overlay-{/} ::: $DEVICES
```

```
parallel 'size=$(blockdev --getsize {}); loop=$(losetup -f --show -- overlay-{/}); echo 0 $size snapshot {} $loop P 8 | dmsetup create {/}' ::: $DEVICES
```

and proceed trying to rebuild with the mapper devices, if everything is going fine i should set the limits to 0, and then stop the raid and i "safe" to run it with the real devices?

i do not really understand the Overlay manipulation functions, is that an script i should run? 

My knowledge is very limited    :Sad: 

Regards

----------

## jpp_

Neddy

Both devices are the ones connected to the pci-e sata card (cheap one)

Maybe is related to the overclock to the bclk bus (from 100 to 106 i believe)

And yesterday when it fail, i was in the Windows VM doing some 100% cpu task with ms excel, and suddently the music start changing songs (mpd), and some downloading says "no acces to disk" and then i found i have no array.

A time ago, i think at 25   :Embarassed:   i see some resync activity, but im so new to this, i see /proc/mdstat and i believe everything is ok, some resync because a bad shutdown i think (my pc hangs when im trying to passtrough some devices to the vm and also because nvidia drivers going from a kernel with smp to no-smp) so i use the sysrq to close, sync, stop and reboot and everything was fine. Also after that i disconnect the 8 drives until i finish the testings/ I not pay really a good attention to the output of mdstat, surely one drive was marked at spare at that time.

Update:

Also i check all the cables, and change the devices to "empty" the ports as following

Before: 

Port 1 x16_1: GTX750

Port 2 x8_2a: Empty

Port 3 x8_2b: Sound card

Port 4 x16/8_3: GTX980

Port 5 x1_1: Empty

Port 6 x8_4: Sata Card

Now:

Port 1 x16_1: GTX750

Port 2 x8_2a: GTX980

Port 3 x8_2b: Empty

Port 4 x16/8_3: Empty

Port 5 x1_1: Sound Card

Port 6 x8_4: Sata Card

So the sata card now is "alone". But i do not think this was the problem, i think it is related to bclk Overclock

How should i proceed now Neddy?

Do the overlay and try to assemble? Or assemble directly?

Update 2: i found this

http://unix.stackexchange.com/questions/244419/marvell-88se9128-9123-sata-card-weird-behaviour-opensuse.

Maybe disabling NCQ Solves my problem?

ThanksLast edited by jpp_ on Wed Apr 05, 2017 5:45 pm; edited 1 time in total

----------

## NeddySeagoon

juampii,

dmesg indicates that you have 18 SATA HDD connection points 

```
[    0.658333] ahci 0000:00:1f.2: AHCI 0001.0300 32 slots 6 ports 6 Gbps 0x3f impl SATA mode

[    0.684311] ahci 0000:05:00.0: AHCI 0001.0200 32 slots 8 ports 6 Gbps 0xff impl SATA mode

[    0.685754] ahci 0000:0c:00.0: AHCI 0001.0200 32 slots 2 ports 6 Gbps 0x3 impl SATA mode

[    0.686354] ahci 0000:0d:00.0: AHCI 0001.0200 32 slots 2 ports 6 Gbps 0x3 impl SATA mode
```

lspci adds more information .. in the same PCI bus number order

```
00:1f.2 SATA controller: Intel Corporation C600/X79 series chipset 6-Port SATA AHCI Controller (rev 06)

05:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9123 PCIe SATA 6.0 Gb/s controller (rev 11)

0c:00.0 SATA controller: ASMedia Technology Inc. ASM1062 Serial ATA Controller (rev 01)

0d:00.0 SATA controller: ASMedia Technology Inc. ASM1062 Serial ATA Controller (rev 01) 
```

The Marvel 88SE9123 is probably only present on your motherboard to provide a PATA interface, the SATA ports, if they are really there may not be wired, so you can't use them.

That leaves the two ASM1062 SATA ports on your plug in card.  I hope that's a least a 4 lane PCIe card.

Following through some more scsi host0..5, ata1..6 are an the intel card.

scsi host6..13,  ata7..14 are on the  Marvell  88SE9123 controller.

scsi host14 and 15,  ata15 and 16 are on the ASMedia at bus  0c:00.0

scsi host16 and 17,  ata17 and 18 are on the ASMedia at bus  0d:00.0

We see that ata3..6 have WDC WD4004FZWX attached.  Thats the Intel Controller.

ata7 and ata8 on the Marvell  88SE9123 have WD4004FZWX attached

and ata15 and ata16 on the ASMedia at bus 0c:00.0 also have WD4004FZWX drives attached

That's your eight 4TB HDD.

The sd xy:0:0:0 numbers in dmesg correspond to the scsi host numbers above. They don't always but you don't have any PATA drives.

Therefore /dev/sd[cdef] are attached to the Intel controller.

/dev/sd[gh]are on the Marvell Controller

/dev/sd[ij] are on the ASMedia at bus 0c:00.0

It appears that both your problem drives are on the Marvell Controller. That points to an issue with the controller and reinforces the view from the smart data that the drives are OK.

Abandon the Marvell SATA Controller before you do much more.

----------

## jpp_

I can boot the os from the marvell controller, and atach the 8 raid members to the onboard controllers, if i do that, 

what should i do after? it is safe to use it without raid? or should i avoid using it?

This until i get a new one (not marvell)

Thanks.

----------

## NeddySeagoon

juampii,

Its not clear to me which is the plug in controller, the Marvel or the ASMedia.

You have 8 SATA ports on the motherboard.  The Intel chip only provides 6.

As the  Marvel controller also provides an IDE interface,

```
05:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9123 PCIe SATA 6.0 Gb/s controller (rev 11)

05:00.1 IDE interface: Marvell Technology Group Ltd. 88SE912x IDE Controller (rev 11) 
```

I suspect that its two of the on board SATA ports that need to not be used.

Unplug your plug in SATA card and check what goes away when you run lspci. 

Either 

```
05:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9123 PCIe SATA 6.0 Gb/s controller (rev 11)

05:00.1 IDE interface: Marvell Technology Group Ltd. 88SE912x IDE Controller (rev 11) 
```

or 

```
0c:00.0 SATA controller: ASMedia Technology Inc. ASM1062 Serial ATA Controller (rev 01)

0d:00.0 SATA controller: ASMedia Technology Inc. ASM1062 Serial ATA Controller (rev 01) 
```

I suspect your plug in card also has USB3 ports so

```
08:00.0 USB controller: ASMedia Technology Inc. ASM1042 SuperSpeed USB Host Controller

09:00.0 USB controller: ASMedia Technology Inc. ASM1042 SuperSpeed USB Host Controller

0a:00.0 USB controller: ASMedia Technology Inc. ASM1042 SuperSpeed USB Host Controller

0b:00.0 USB controller: ASMedia Technology Inc. ASM1042 SuperSpeed USB Host Controller 
```

will vanish too.

If your plug in card is the ASMedia, it appears to have a further two SATA ports.  

They may not be easy to use as if they are actually present, they may be headers or wired to the backplate as eSATA ports.

Personally,  I would not use the Marvel SATA interfaces for anything.

----------

## jpp_

Hi Neddy

The asmedia ones are from the motherboard. There is no IDE interface in this motherboard.

The marvell is the plug in card.

The usb 3 port are from the motherboard, the same asmedia chip i believe, also the same for esata hard drives, it has two esata ports.

This is the plugin card.

https://www.nisuta.com/images/productos/grandes/NSPLPCIES3.jpg

Im going to replace with a better one, maybe M1015 or M1515

But for the moment, and for make a backup.

I can put the os (previous backup) in the marvel "controller" and the 8 raid members in the motherboard.

What should i do after that to have the array working again and in a safe state?

----------

## NeddySeagoon

juampii,

That image is a single lane PCIe card.  A single SATA3 port is 6Gb/s. A single PCIe lane is 5Gb/sec.

It may not matter, as the head/platter data rate for a HDD is about 120MB/sec or about 1Gbit/sec.  

Ideally, you need a PCIe x 4 card if you have a card slot for it.

Its good to know that all the on board hardware is OK. 

Follow frostschutzs advice with overlays to explore the state of your raid.

As you will only be working with 7 members of your raid set with the overlay, I would be tempted to to only use the onboard SATA ports and not connect the raid member you don't want.

Either that, or wait for your new hardware.

----------

## jpp_

I am in the process of creating the overlay with 7 devices attached to the mainboard, but i get

/bin/bash: dmsetup: command not found

Im trying to find wich package provides the command but i found nothing.

----------

## Hu

```
 * Searching for dmsetup ... 

sys-fs/lvm2-2.02.145-r2 (/sbin/dmsetup)

```

----------

## jpp_

Thanks Hu

I get another error now   :Crying or Very sad: 

Mount the 3 tb hd at /mnt/3TB

cd /mnt/3TB

UUID=$(mdadm -E /dev/sdd1|perl -ne '/Array UUID : (\S+)/ and print $1')

echo $UUID

```
01c86d33:50403b16:40e0da06:a7e3510a
```

DEVICES=$(cat /proc/partitions | parallel --tagstring {5} --colsep ' +' mdadm -E /dev/{5} |grep $UUID | parallel --colsep '\t' echo /dev/{1})

```

   mdadm: cannot open /dev/: Invalid argument

   mdadm: cannot open /dev/: Invalid argument

ram0   mdadm: No md superblock detected on /dev/ram0.

ram1   mdadm: No md superblock detected on /dev/ram1.

ram2   mdadm: No md superblock detected on /dev/ram2.

ram3   mdadm: No md superblock detected on /dev/ram3.

ram4   mdadm: No md superblock detected on /dev/ram4.

ram5   mdadm: No md superblock detected on /dev/ram5.

ram6   mdadm: No md superblock detected on /dev/ram6.

ram7   mdadm: No md superblock detected on /dev/ram7.

ram8   mdadm: No md superblock detected on /dev/ram8.

ram9   mdadm: No md superblock detected on /dev/ram9.

ram10   mdadm: No md superblock detected on /dev/ram10.

ram11   mdadm: No md superblock detected on /dev/ram11.

ram12   mdadm: No md superblock detected on /dev/ram12.

ram13   mdadm: No md superblock detected on /dev/ram13.

ram14   mdadm: No md superblock detected on /dev/ram14.

ram15   mdadm: No md superblock detected on /dev/ram15.

sda2   mdadm: No md superblock detected on /dev/sda2.

sdj1   mdadm: No md superblock detected on /dev/sdj1.
```

echo $DEVICES

```
/dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1
```

parallel truncate -s300G overlay-{/} ::: $DEVICES

and the error

parallel 'size=$(blockdev --getsize {}); loop=$(losetup -f --show -- overlay-{/}); echo 0 $size snapshot {} $loop P 8 | dmsetup create {/}' ::: $DEVICES

```
device-mapper: reload ioctl on sdb1 failed: Invalid argument

Command failed

device-mapper: reload ioctl on sdc1 failed: Invalid argument

Command failed

device-mapper: reload ioctl on sdf1 failed: Invalid argument

Command failed

device-mapper: reload ioctl on sdg1 failed: Invalid argument

Command failed

device-mapper: reload ioctl on sdd1 failed: Invalid argument

Command failed

device-mapper: reload ioctl on sde1 failed: Invalid argument

Command failed

device-mapper: reload ioctl on sdh1 failed: Invalid argument

Command failed
```

edit:

i change 

parallel 'size=$(blockdev --getsize {}); loop=$(losetup -f --show -- overlay-{/}); echo 0 $size snapshot {} $loop P 8 | dmsetup create {/}' ::: $DEVICES

to

parallel 'dmsetup create {/}' ::: $DEVICES

And now its "working"

I hope im not doing any damage   :Shocked: 

edit 2:

Is normal to the proccess to be slow?

It still working, with low hdd activity, and a part of it are from ext4lazyinit

I see (with iotop) an actual disk write of 7.6 / 7.7 M/s

Or i need to do the next step in another terminal while this thing is running??

----------

## jpp_

After no success i try this:

mdadm --stop /dev/md0

```
mdadm: stopped /dev/md0
```

mdadm --assemble --force /dev/md0 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1

```
mdadm: forcing event count in /dev/sdb1(5) from 81087 upto 81935

mdadm: /dev/md0 assembled from 7 drives - not enough to start the array.

```

cat /proc/mdstat 

```
Personalities : [linear] [raid0] [raid1] [raid10] 

md0 : active raid10 sdf1[3] sdd1[1] sde1[2] sdb1[5] sdc1[0] sdh1[7] sdg1[6]

      15627452416 blocks super 1.2 512K chunks 2 far-copies [8/7] [UUUU_UUU]

      bitmap: 11/117 pages [44KB], 65536KB chunk

unused devices: <none>
```

So it is ok to mount and fsck the filesystem?

edit: i rebooted and the array was automounted at start, i enter the folder and everything seems ok, but i unmount it before doing some changes.

What should i do before backing up the data? --scan? fsck?

i see this in dmesg

dmesg | grep md0

```
[    3.031223] md: md0 stopped.

[    3.039893] md/raid10:md0: active with 7 out of 8 devices

[    3.054787] md0: detected capacity change from 0 to 16002511273984

[    4.058146] EXT4-fs warning (device md0): ext4_clear_journal_err:4692: Filesystem error recorded from previous mount: IO failure

[    4.058147] EXT4-fs warning (device md0): ext4_clear_journal_err:4693: Marking fs in need of filesystem check.

[    4.072567] EXT4-fs (md0): warning: mounting fs with errors, running e2fsck is recommended

[    4.144954] EXT4-fs (md0): recovery complete

[    4.156672] EXT4-fs (md0): mounted filesystem with ordered data mode. Opts: (null)

```

Thanks

----------

## NeddySeagoon

juampii,

The --force was a bad thing to do.  You know one of you drives was out of sync but you don't know the damage it did to your data,

The overlay was designed to let you look at that before you did --force, or to copy out good files if damage was extensive.

You have already written to the drives.  

--assemble --force wrote the raid metadata. 

fsck, if it finds anything at all may make a bad situation worse.  It says nothing about user data, only that the filesystem metadata is self consistent.

It can destroy user data in the process, as the assumptions it makes about the filesystem are not always correct.

If you want to run fsck, be sure you have a way to undo it. 

The overlay would have allowed you to test without actually writing to your raid drives.

If you attempt to mount the filesystem, even read only, the journal will be replayed.  With one drive out of date the journal may not be consistent and could add to the damage.

The most important thing with a failing HDD or a failed raid set is to do nothing you can't undo, hence you experiment with the overlay or a complete image of the raid because you only get one chance with the real thing. 

Its my opinion that you should restore from your out of date backups but others may have other ideas.

----------

## jpp_

Hi neddy.

Yes,  i know it was a bad idea. 

My plan is to create a full backup using first the outdated backup and adding the rest, previous test, so far,  the new files are ok,  and also the most recent ones.

Im interested in knowing what was im doing wrong with the overlay procces, so i can learn to do it properly.

----------

## frostschutz

There ought to be an utility (in coreutils or lvm, or where-ever) dedicated to creating overlays. They are extremely useful yet the method described in the wiki seems hard to follow (works for me exactly as described, not sure what you did there)

What's done is done, hope you have regained access to some/most of your files.

Luck

```

# cd /dev/shm/

# truncate -s 8G foobar.img

# losetup --find --show foobar.img

/dev/loop0

# mkfs.ext4 /dev/loop0

# devices="/dev/loop0"

# overlay_create

free 7924M

/dev/loop0 8192M /dev/loop1 /dev/mapper/loop0

# file -s /dev/loop0

/dev/loop0: Linux rev 1.0 ext4 filesystem data, UUID=a27cd01f-4a05-4dbe-989a-329eb8ac4f3d (extents) (64bit) (large files) (huge files)

# file -s /dev/mapper/loop0 

/dev/mapper/loop0: Linux rev 1.0 ext4 filesystem data, UUID=a27cd01f-4a05-4dbe-989a-329eb8ac4f3d (extents) (64bit) (large files) (huge files)

# mkfs.xfs -f /dev/mapper/loop0

# file -s /dev/loop0

/dev/loop0: Linux rev 1.0 ext4 filesystem data, UUID=a27cd01f-4a05-4dbe-989a-329eb8ac4f3d (extents) (64bit) (large files) (huge files)

# file -s /dev/mapper/loop0

/dev/mapper/loop0: SGI XFS filesystem data (blksz 4096, inosz 512, v2 dirs)

# overlay_remove 

/dev/mapper/loop0

loop0.ovr

/dev/loop1

# overlay_create

free 7924M

/dev/loop0 8192M /dev/loop1 /dev/mapper/loop0

# file -s /dev/mapper/loop0 

/dev/mapper/loop0: Linux rev 1.0 ext4 filesystem data, UUID=a27cd01f-4a05-4dbe-989a-329eb8ac4f3d (extents) (64bit) (large files) (huge files)

```

----------

## jpp_

Good news, every new file (not in backup) works, there are photos, documents, spreadsheets, videos, tables, isos, backups, tars, everything. Thanks all for the help.

I believe the problem was the adapter, i order an ibm m1115 and the cables, it should arrive in a week or so, so im going to rebuild the array there.

But maybe, the fault was not the adapter, Im reading at the raid wiki page and i found this: https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

And im confused abut the output of my disks

smartctl -l scterc /dev/sdd

```
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.10.5-gentoo] (local build)

Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

SCT Error Recovery Control command not supported
```

But in smartctl -a i found:

```
SCT capabilities:           (0x0035)   SCT Status supported.

               SCT Feature Control supported.

               SCT Data Table supported.
```

What that means? I cant control SCT but it is supported?

Any idea how can i check if the disks have it enabled?

I get the WD black ones thinking they perform better than red, but maybe they are not the best for raid.

Regards

----------

## frostschutz

 *juampii wrote:*   

> But maybe, the fault was not the adapter, Im reading at the raid wiki page and i found this: https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

 

I disagree with that wiki article. (In particular "the drive can't read the data" ... "glitches like this are normal").

You don't want drives in your raid array that "can't read the data". At all. Ever. Two of those will kill your RAID.

A disk that gets stuck for >30 seconds and refuses to respond to writes as well, deserves to be kicked.

Your problem was not a Timeout Mismatch. It was not noticing at all the first failure and not acting until the 2nd failure took down your array. If you want to investigate cause more specifically you'd need dmesg output / system logs from the time of both failures.

If you want to avoid raid failures, crank up your disk monitoring, smart selftests, and mail notifications for everything. Not noticing RAID failure is not acceptable.

Timeouts, not timeouts, it makes no difference.

----------

## NeddySeagoon

juampii,

Your errors were both 

```
Error: ICRC, ABRT at LBA = 0x00000000 = 0
```

This indicates an Interface Cyclic Redundancy Check error.

It means the drive read the data OK but a problem was detected transferring the data from the drive(s) to the system.

That does not mean the problem was real ... just that a problem was detected.

According to Google, Marvell controllers have a history of fimware updates to address these errors.

Thats OK when its system BIOS update.  I don't know how/if the firmware on your card can be updated.

There is a small chance it could be the data cables ... but two data cables on the same interface.  That's unlikely.

I don't agree with that Wiki article either ... but timeouts were not your problem.

----------

## jpp_

Thanks very much you both.

Yes frostschutz, i know it was my fault and lake of experience, but hey, i can learn from this  :Smile: 

have a nice day

----------

