# [Solved] Debug RAID Array Error Messages

## jasn

I've got a 3 drive RAID5 array in my system, and while checking the array with cat /proc/mdstat and with mdadm -D, shows that everything is synchronized and fine with the array, I've still got errors in my /var/log/messages file that indicate at least a regularly occurring read error with the array. Unfortunately this is an array of 3 Seagate ST9500420AS drives, which plenty of people have been complaining about regarding a clicking and freezing problem where apparently the heads park accompanied by a noticeable system pause, which is definitely happening on my system. My issue is that I don't know if this is something that is happening to all 3 of the drives or just to 1 of the 3. Is there some way to determine the drive causing the read errors from the /var/log/messages entries?Here's the pastebin link to the relevant section of my /var/log/messagesThanksLast edited by jasn on Thu Oct 08, 2009 5:54 pm; edited 1 time in total

----------

## drescherjm

I see errors for ata1 and ata2. I would install smartmontools and do smart tests on all drives. Also if no drives are being kicked out of the array making the array degraded the errors are being recovered or they are only happening during reads.

----------

## jasn

Thanks John,

I took your advice and ran the smart tests, both short and long version, and I didn't expect it to queue the tests and run them in the background. (I think there's an option to change that, but anyway I queued both the short tests on each drive;

```
smartctl --test=short /dev/sda

smartctl --test=short /dev/sdb

smartctl --test=short /dev/sdc
```

and the long tests;

```
smartctl --test=long /dev/sda

smartctl --test=long /dev/sdb

smartctl --test=long /dev/sdc
```

and then I later came back and displayed the results;

```
smartctl -l selftest /dev/sda

smartctl version 5.38 [x86_64-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen

Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline    Completed without error       00%       965         -

# 2  Short offline       Completed without error       00%       960         -

# 3  Short offline       Completed without error       00%       877         -

# 4  Short offline       Completed without error       00%         0         -
```

```
smartctl -l selftest /dev/sdb

smartctl version 5.38 [x86_64-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen

Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline    Completed without error       00%       966         -

# 2  Short offline       Completed without error       00%       961         -

# 3  Short offline       Completed without error       00%       878         -

# 4  Short offline       Completed without error       00%         2         -

# 5  Short offline       Completed: read failure       90%         1         17405241
```

```
smartctl -l selftest /dev/sdc

smartctl version 5.38 [x86_64-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen

Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline    Completed without error       00%       965         -

# 2  Short offline       Completed without error       00%       960         -

# 3  Short offline       Completed without error       00%       877         -

# 4  Short offline       Completed without error       00%         1         -
```

So it looks like my problems are associated with /dev/sdb. I'm going to replace it and see if this addresses the issues.

Thanks again.

----------

## energyman76b

>raid5:md4: read error corrected (8 sectors at 631742912 on sdb4

this and 

ata2.00: error: { ABRT }

Sep 20 10:05:53 hostname kernel: [  610.438209]          res 41/40:38:87:15:80/dd:00:26:00:00/00 Emask 0x409 (media error

really tell the whole story. sdb has severe platter damage and is on its way to the green pasture. There was no need to touch smart - errors like that are very obvious....

----------

## drescherjm

However there are also:

```
#

Sep 20 11:22:52 hostname kernel: [ 5226.948874] raid5:md4: read error corrected (8 sectors at 496328008 on sda4)

#

Sep 20 11:22:52 hostname kernel: [ 5226.948879] raid5:md4: read error corrected (8 sectors at 496328016 on sda4)
```

----------

## energyman76b

which just means that the array saw the error and constructed the missing data from the other disks. If any of the healthy disks fails all data is lost.

The harddisk is toast. Replacement must happen ASAP.

----------

## drescherjm

There was also:

```
#

Sep 20 11:22:38 hostname kernel: [ 5212.997979] ata1.00: error: { ABRT }

#

Sep 20 11:22:42 hostname kernel: [ 5216.290205]          res 41/40:00:f7:d0:6d/59:01:1e:00:00/00 Emask 0x409 (media error) <F>

#

Sep 20 11:22:42 hostname kernel: [ 5216.290212] ata1.00: error: { UNC }

#

Sep 20 11:22:47 hostname kernel: [ 5221.910626]          res 41/40:00:fc:d0:6d/59:01:1e:00:00/00 Emask 0x409 (media error) <F>

#

Sep 20 11:22:47 hostname kernel: [ 5221.910632] ata1.00: error: { UNC }

#

Sep 20 11:22:52 hostname kernel: [ 5226.924364]          res 41/40:00:fb:d0:6d/59:01:1e:00:00/00 Emask 0x409 (media error) <F>
```

Anyways with errors on both drives I thought it would be best to have the drives check themselves instead of saying that 2 drives out of 3 are toast.

----------

## energyman76b

well, the part of the smart tests quoted are not the most important part.

smartctl -a give a lot more output which can be a lot more enlightening.

No matter what, the system is busy papering over errors produced by sdb. As soon as any of the other disks produce errors too the raid array will be lost. Raid5 can only tolerate one bad disk, not two.

----------

## jasn

Thank you both for pointing these things out. (Would you believe that this is a new system?) Anyway my vendor is sending me the replacement drive as we speak, but I thought I would post the output of smartctl -a /dev/sdb here as well;

```
smartctl version 5.38 [x86_64-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen

Home page is http://smartmontools.sourceforge.net/                          

=== START OF INFORMATION SECTION ===

Device Model:     ST9500420AS       

Serial Number:    5VJ0F19P          

Firmware Version: 0002SDM1          

User Capacity:    500,107,862,016 bytes

Device is:        Not in smartctl database [for details use: -P showall]

ATA Version is:   8                                                     

ATA Standard is:  ATA-8-ACS revision 4                                  

Local Time is:    Fri Oct  2 00:52:04 2009 EDT                          

SMART support is: Available - device has SMART capability.              

SMART support is: Enabled                                               

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

See vendor-specific Attribute list for marginal Attributes.

General SMART Values:

Offline data collection status:  (0x00) Offline data collection activity

                                        was never started.              

                                        Auto Offline Data Collection: Disabled.

Self-test execution status:      (   0) The previous self-test routine completed

                                        without error or no self-test has ever  

                                        been run.                               

Total time to complete Offline                                                  

data collection:                 (   0) seconds.                                

Offline data collection                                                         

capabilities:                    (0x73) SMART execute Offline immediate.        

                                        Auto Offline data collection on/off support.

                                        Suspend Offline collection upon new         

                                        command.                                    

                                        No Offline surface scan supported.          

                                        Self-test supported.                        

                                        Conveyance Self-test supported.             

                                        Selective Self-test supported.              

SMART capabilities:            (0x0003) Saves SMART data before entering            

                                        power-saving mode.                          

                                        Supports SMART auto save timer.             

Error logging capability:        (0x01) Error logging supported.                    

                                        General Purpose Logging supported.          

Short self-test routine                                                             

recommended polling time:        (   2) minutes.                                    

Extended self-test routine                                                          

recommended polling time:        ( 109) minutes.                                    

Conveyance self-test routine                                                        

recommended polling time:        (   3) minutes.                                    

SCT capabilities:              (0x103f) SCT Status supported.                       

                                        SCT Feature Control supported.              

                                        SCT Data Table supported.                   

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:  

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000f   105   092   006    Pre-fail  Always       -       10219475 

  3 Spin_Up_Time            0x0003   098   097   085    Pre-fail  Always       -       0        

  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       47       

  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0        

  7 Seek_Error_Rate         0x000f   073   060   030    Pre-fail  Always       -       21263879 

  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       985      

 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0        

 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       47       

184 Unknown_Attribute       0x0032   100   100   099    Old_age   Always       -       0        

187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       166      

188 Unknown_Attribute       0x0032   100   099   000    Old_age   Always       -       4        

189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0        

190 Airflow_Temperature_Cel 0x0022   048   034   045    Old_age   Always   In_the_past 52 (2 214 54 46)

191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0               

192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       1               

193 Load_Cycle_Count        0x0032   045   045   000    Old_age   Always       -       110352          

194 Temperature_Celsius     0x0022   052   066   000    Old_age   Always       -       52 (0 23 0 0)   

195 Hardware_ECC_Recovered  0x001a   041   035   000    Old_age   Always       -       10219475        

197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0               

198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0               

199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0               

240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       218364727264016 

241 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       3771793122      

242 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       3025861912      

254 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0               

SMART Error Log Version: 1

ATA Error Count: 124 (device log contains only the most recent five errors)

        CR = Command Register [HEX]                                        

        FR = Features Register [HEX]                                       

        SC = Sector Count Register [HEX]                                   

        SN = Sector Number Register [HEX]                                  

        CL = Cylinder Low Register [HEX]                                   

        CH = Cylinder High Register [HEX]                                  

        DH = Device/Head Register [HEX]                                    

        DC = Device Command Register [HEX]                                 

        ER = Error register [HEX]                                          

        ST = Status register [HEX]                                         

Powered_Up_Time is measured from power on, and printed as                  

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,                      

SS=sec, and sss=millisec. It "wraps" after 49.710 days.                    

Error 124 occurred at disk power-on lifetime: 962 hours (40 days + 2 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH                              

  -- -- -- -- -- -- --                              

  40 51 00 ff ff ff 0f                              

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  60 00 00 ff ff ff 4f 00      00:39:59.466  READ FPDMA QUEUED   

  60 00 00 ff ff ff 4f 00      00:39:59.466  READ FPDMA QUEUED   

  60 00 00 ff ff ff 4f 00      00:39:59.466  READ FPDMA QUEUED   

  60 00 00 ff ff ff 4f 00      00:39:59.465  READ FPDMA QUEUED   

  60 00 00 ff ff ff 4f 00      00:39:59.465  READ FPDMA QUEUED   

Error 123 occurred at disk power-on lifetime: 962 hours (40 days + 2 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH                              

  -- -- -- -- -- -- --                              

  40 51 00 ff ff ff 0f                              

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  60 00 00 ff ff ff 4f 00      00:39:54.874  READ FPDMA QUEUED   

  60 00 00 ff ff ff 4f 00      00:39:54.874  READ FPDMA QUEUED   

  60 00 00 ff ff ff 4f 00      00:39:54.874  READ FPDMA QUEUED   

  60 00 00 ff ff ff 4f 00      00:39:54.873  READ FPDMA QUEUED   

  60 00 00 ff ff ff 4f 00      00:39:54.865  READ FPDMA QUEUED   

Error 122 occurred at disk power-on lifetime: 961 hours (40 days + 1 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH                              

  -- -- -- -- -- -- --                              

  40 51 00 ff ff ff 0f                              

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  60 00 00 ff ff ff 4f 00   1d+05:31:13.752  READ FPDMA QUEUED   

  60 00 c0 ff ff ff 4f 00   1d+05:31:13.120  READ FPDMA QUEUED   

  60 00 00 ff ff ff 4f 00   1d+05:31:13.120  READ FPDMA QUEUED   

  60 00 00 ff ff ff 4f 00   1d+05:31:13.120  READ FPDMA QUEUED   

  60 00 00 ff ff ff 4f 00   1d+05:31:13.120  READ FPDMA QUEUED   

Error 121 occurred at disk power-on lifetime: 961 hours (40 days + 1 hours)

  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH                              

  -- -- -- -- -- -- --                              

  40 51 00 ff ff ff 0f                              

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  60 00 00 ff ff ff 4f 00   1d+05:23:39.746  READ FPDMA QUEUED

  27 00 00 00 00 00 e0 00   1d+05:23:39.745  READ NATIVE MAX ADDRESS EXT

  ec 00 00 00 00 00 a0 00   1d+05:23:39.736  IDENTIFY DEVICE

  ef 03 46 00 00 00 a0 00   1d+05:23:39.736  SET FEATURES [Set transfer mode]

  27 00 00 00 00 00 e0 00   1d+05:23:39.735  READ NATIVE MAX ADDRESS EXT

Error 120 occurred at disk power-on lifetime: 961 hours (40 days + 1 hours)

  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 ff ff ff 0f

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  60 00 00 ff ff ff 4f 00   1d+05:23:33.555  READ FPDMA QUEUED

  60 00 00 ff ff ff 4f 00   1d+05:23:33.555  READ FPDMA QUEUED

  60 00 00 ff ff ff 4f 00   1d+05:23:33.555  READ FPDMA QUEUED

  60 00 00 ff ff ff 4f 00   1d+05:23:33.553  READ FPDMA QUEUED

  27 00 00 00 00 00 e0 00   1d+05:23:33.552  READ NATIVE MAX ADDRESS EXT

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline    Completed without error       00%       966         -

# 2  Short offline       Completed without error       00%       961         -

# 3  Short offline       Completed without error       00%       878         -

# 4  Short offline       Completed without error       00%         2         -

# 5  Short offline       Completed: read failure       90%         1         17405241

SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.
```

----------

## energyman76b

187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       166

that are errors the builtin ECC correction could not correct and were recognized by the drives firmware. Bad, bad...

the good news:

197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0                

 198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0     

so it hasn't run out of replacement sectors. Yet. Or the smart data is bogus (it happens).

this is surprising:

Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0        

so it had errors, and the system says media errors, but the disk did not see them as well as such and did not reallocate defective sectors. Oh. wait - reallocation happens on write, not read ... 

Or again, smart data is bogus. 

Anyway the disk is in trouble. Cable problems result in dma crc errors and stuff like that.

and that looks either fishy or worrisome:

   7 Seek_Error_Rate         0x000f   073   060   030    Pre-fail  Always       -       21263879

----------

