# mdadm error log -- does it exist?

## mbar

I had one HDD crash some minutes ago. It's RAID5 so no worries (yet  :Wink: ) about the data. Right now I have only remote SSH access to the server.

I'd like to know what was the cause of the crash and if it's recoverable (or not) after power cycle (restart) of the computer with failed drive (it as well may be broken SATA cable, not for the first time for me). Faulty hdd got kicked out of the RAID, but also it is not responding to anything:

```
Sep 29 10:41:57 [kernel] md/raid:md0: Disk failure on sdg1, disabling device.

Sep 29 10:41:57 [kernel] md/raid:md0: Operation continuing on 5 devices.

Sep 29 10:41:57 [sSMTP] Creating SSL connection to host

Sep 29 10:41:57 [sSMTP] SSL connection using DHE_RSA_AES_128_CBC_SHA1

Sep 29 10:41:59 [sSMTP] Sent mail for aaaa(221 2.0.0 Bye) uid=0 username=root outbytes=1015

Sep 29 10:41:59 [mdadm] Fail event detected on md device /dev/md0

Sep 29 10:41:59 [sSMTP] Creating SSL connection to host

Sep 29 10:41:59 [sSMTP] SSL connection using DHE_RSA_AES_128_CBC_SHA1

Sep 29 10:42:00 [sSMTP] Sent mail for aaaa (221 2.0.0 Bye) uid=0 username=root outbytes=1069

Sep 29 10:42:00 [mdadm] Fail event detected on md device /dev/md0, component device /dev/sdg1

```

```
 smartctl -a /dev/sdg

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.0.4-gentoo] (local build)

Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

Vendor:               /8:0:0:0

Product:

User Capacity:        600,332,565,813,390,450 bytes [600 PB]

Logical block size:   774843950 bytes

>> Terminate command early due to bad response to IEC mode page

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

```

I wouldn't say no to 600 PB... unfortunately it's in alternative universe now  :Wink: 

```
 cat /proc/mdstat

Personalities : [raid0] [raid6] [raid5] [raid4]

md2 : active raid0 sdc1[0] sdb1[1]

      1953518848 blocks super 1.2 128k chunks

md0 : active raid5 sde1[0] sdd1[6] sdg1[3](F) sdf1[4] sdh1[1] sda1[5]

      7325680640 blocks super 1.2 level 5, 256k chunk, algorithm 2 [6/5] [UU_UUU]

      bitmap: 2/11 pages [8KB], 65536KB chunk

unused devices: <none>

```

But I also noticed:

```
 cat /sys/block/md0/md/dev-sdg1/errors

16

```

Is there any way to check what kind of errors those 16 were? This would help me decide if I have to buy a new drive on my way home  :Smile: 

----------

## jbest

My raid5 array and my raid1 array failed this morning, too, in a very similar way to yours:

```
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.0-ARCH] (local build)

Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

Vendor:               /3:0:1:0

Product:              

User Capacity:        600,332,565,813,390,450 bytes [600 PB]

Logical block size:   774843950 bytes

scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46

>> Terminate command early due to bad response to IEC mode page

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

```

```
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.0-ARCH] (local build)

Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

Vendor:               /3:0:0:0

Product:              

User Capacity:        600,332,565,813,390,450 bytes [600 PB]

Logical block size:   774843950 bytes

scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46

>> Terminate command early due to bad response to IEC mode page

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

```

What's different, though, is I don't have any errors on any of the partitions that were part of the array:

```
# cat /sys/block/md127/md/dev-sda2/errors 

0

# cat /sys/block/md127/md/dev-sdb2/errors 

0

# cat /sys/block/md127/md/dev-sdc2/errors 

0

```

And the raid1 array:

```
# cat /sys/block/md126/md/dev-sda1/errors 

0

# cat /sys/block/md126/md/dev-sdb1/errors 

0

# cat /sys/block/md126/md/dev-sdc1/errors 

0
```

This is the second time in a week that this has happened, can't figure out why this is going on.  I'd call it just plain drive failures, but I know sdb and sdc (the drives that "failed") are on the same SATA controller. I purchased sda and sdc at the same time, and sdb is a two week old drive. That doesn't really add up for me...

dmesg has a lot of information, but I have no idea where to go from here with it:

http://pastebin.com/UuUq2mg1

Raid info:

http://pastebin.com/SM8d3d0w

----------

## drescherjm

I see this when a drive has too many UREs and the drive goes completely offline trying to fix them. Or when a drive totally dies. This seems to happen a few times a year with my arrays based on Seagate 7200.10 and 7200.11 drives. I now have moved all arrays to raid 6 and monitor the status of 5 key SMART parameters to better predict drive failure. Btw when I have had failure like more than 2 drives kicked out of a raid6, I was able to recover by using ddrescue ()to recover the readable parts) on the drives that were kicked out of the array to new disks.

----------

## jbest

 *drescherjm wrote:*   

> I see this when a drive has too many UREs and the drive goes completely offline trying to fix them. This seems to happen a few times a year with my arrays based on Seagate 7200.10 and 7200.11 drives.

 

Forgive my ignorance, but URE?  I'm guessing "RE" is read error, but I can't figure out what the "U" is for.  It's too early in the morning for me.

FWIW, these are all 2TB Seagate 5900RPM "green" drives.

----------

## drescherjm

I believe it is unrecoverable read error. These show as Current_Pending_Sector and / or "Offline_Uncorrectable" in SMART.

BTW, here is a link to my script that checks the smart params:

https://raw.github.com/drescherjm/jmdgentoooverlay/master/Other/shell-scripts/examine_mdraid.sh

----------

## drescherjm

As for the OPs question. I do not believe a log exists for this. However most of the time you will see errors for a drive in your dmesg. More than what you posted.

----------

## jbest

 *drescherjm wrote:*   

> BTW, here is a link to my script that checks the smart params:
> 
> https://raw.github.com/drescherjm/jmdgentoooverlay/master/Other/shell-scripts/examine_mdraid.sh

 

Excellent! Thanks!

I just bought a backup drive, I'll rsync all of the data off with a livecd tonight and go from there.

Cheers!

----------

## drescherjm

Note about that script. For some manufacturers (like seagate) some of the params may be bogus. You will know that when a value is like 5443455 and you are expecting 10..

BTW, I did not explain exactly what the script does. It enumerates all /dev/sd devices. Checks to see if the device is in any of your mdadm arrays and prints 5 key SMART params for the drive. I use this at work for my 75 to 100 drives in mdadm arrays. I also use nagios to monitor the temps and the reallocated sectors count for each drive.

----------

## mbar

Thanks for all your replies!

Fortunately for me it was only a matter of reseating SATA cable for failed drive. After that it showed up fine and smartctl did not found any errors. Also write intent bitmap saved me from 1,5 days of resync  :Smile: 

----------

## mbar

 *jbest wrote:*   

> This is the second time in a week that this has happened, can't figure out why this is going on. I'd call it just plain drive failures, but I know sdb and sdc (the drives that "failed") are on the same SATA controller

 

I'm almost sure it's the SATA controller or cable that causes your trouble.

----------

## drescherjm

 *mbar wrote:*   

> Thanks for all your replies!
> 
> Fortunately for me it was only a matter of reseating SATA cable for failed drive. After that it showed up fine and smartctl did not found any errors. 

 

Not even UDMA_CRC_Error_Count?

 *Quote:*   

> Also write intent bitmap saved me from 1,5 days of resync 

 

Slow machine? At work it takes me less than 9 hours to resync a 9 drive  (2TB 7200 RPM hitachi 7k2000) mdadm raid 6 on a 3 year old core2quad. However most of the drives are connected to an intel sascui8 hba card.

----------

## mbar

Yeah, you are right:

```
UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       23
```

Highest of all my drives (others are in 0 to 4 UDMA errors range).

Slow machine? It has simple desktop motherboard (Nvidia + AMD Phenom X3, onboard SATA + 2 PCIE SATA "Dumb" Silicon Image controllers) and the hard drives are Samsung HD154UI (5400 RPM) so not exactly an I/O speed demon  :Smile: 

----------

## mbar

fcuk, it happened again, in the same funny way:

```
cat /proc/mdstat

Personalities : [raid0] [raid6] [raid5] [raid4]

md0 : active raid5 sdg1[3](F) sdh1[1] sda1[5] sdf1[4] sdd1[6] sde1[0]

      7325680640 blocks super 1.2 level 5, 256k chunk, algorithm 2 [6/5] [UU_UUU]

      bitmap: 11/11 pages [44KB], 65536KB chunk

md2 : active raid0 sdb1[1] sdc1[0]

      1953518848 blocks super 1.2 128k chunks

```

```
smartctl -a /dev/sdg

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.0.4-gentoo-r2] (local build)

Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

Vendor:               /8:0:0:0

Product:

User Capacity:        600,332,565,813,390,450 bytes [600 PB]

Logical block size:   774843950 bytes

>> Terminate command early due to bad response to IEC mode page

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

```

Time to replace those dumb Silicon Image controllers, seems that one of them is failing.

Last time the failed drive made through SMART test with "drive good" result.

----------

## mbar

This time maybe not so "good health":

```
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       12
```

but:

```
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
```

Should I be very concerned? I had mainly Offline_Uncorrectable failures in the past, and those are definite  :Wink: .

Anyway, the almost full log:

```
=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status:  (0x03)   Offline data collection activity

               is in progress.

               Auto Offline Data Collection: Disabled.

Self-test execution status:      (   0)   The previous self-test routine completed

               without error or no self-test has ever 

               been run.

Total time to complete Offline 

data collection:       (19188) seconds.

Offline data collection

capabilities:           (0x7b) SMART execute Offline immediate.

               Auto Offline data collection on/off support.

               Suspend Offline collection upon new

               command.

               Offline surface scan supported.

               Self-test supported.

               Conveyance Self-test supported.

               Selective Self-test supported.

SMART capabilities:            (0x0003)   Saves SMART data before entering

               power-saving mode.

               Supports SMART auto save timer.

Error logging capability:        (0x01)   Error logging supported.

               General Purpose Logging supported.

Short self-test routine 

recommended polling time:     (   2) minutes.

Extended self-test routine

recommended polling time:     ( 255) minutes.

Conveyance self-test routine

recommended polling time:     (  33) minutes.

SCT capabilities:           (0x003f)   SCT Status supported.

               SCT Error Recovery Control supported.

               SCT Feature Control supported.

               SCT Data Table supported.

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000f   100   100   051    Pre-fail  Always       -       0

  3 Spin_Up_Time            0x0007   072   072   011    Pre-fail  Always       -       9140

  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1386

  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0

  7 Seek_Error_Rate         0x000f   100   100   051    Pre-fail  Always       -       0

  8 Seek_Time_Performance   0x0025   100   100   015    Pre-fail  Offline      -       11089

  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       7815

 10 Spin_Retry_Count        0x0033   100   100   051    Pre-fail  Always       -       0

 11 Calibration_Retry_Count 0x0012   100   100   000    Old_age   Always       -       0

 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       268

 13 Read_Soft_Error_Rate    0x000e   100   100   000    Old_age   Always       -       0

183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       12

184 End-to-End_Error        0x0033   100   100   000    Pre-fail  Always       -       0

187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0

188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0

190 Airflow_Temperature_Cel 0x0022   074   056   000    Old_age   Always       -       26 (Min/Max 21/26)

194 Temperature_Celsius     0x0022   068   054   000    Old_age   Always       -       32 (Min/Max 21/32)

195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       40019

196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0

197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       27

200 Multi_Zone_Error_Rate   0x000a   100   100   000    Old_age   Always       -       0

201 Soft_Read_Error_Rate    0x000a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1

No Errors Logged

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Offline             Self-test routine in progress 80%      7815         -

# 2  Offline             Aborted by host               90%      7814         -

# 3  Short offline       Completed without error       00%      7732         -

# 4  Offline             Completed without error       00%      7626         -

# 5  Extended offline    Interrupted (host reset)      40%      7522         -

# 6  Extended offline    Completed without error       00%      6225         -

# 7  Extended offline    Completed without error       00%      3942         -

# 8  Extended offline    Completed without error       00%      3649         -

# 9  Short offline       Completed without error       00%      3631         -

#10  Offline             Completed without error       00%      3138         -

#11  Offline             Completed without error       50%      3112         -

#12  Offline             Aborted by host               10%      1866         -

#13  Offline             Completed without error       00%      1181         -

#14  Short offline       Completed without error       00%       425         -

#15  Short offline       Completed without error       00%       425         -

#16  Short offline       Completed without error       00%        13         -

SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

```

----------

## mbar

 *mbar wrote:*   

> Time to replace those dumb Silicon Image controllers, seems that one of them is failing.
> 
> 

 

Yes, SATA controller was faulty, a new one solved the problem.

----------

