# HDD problem or not ?

## lostinspace2011

Smartctl reports errors one one of my drives

```

root@server:~# smartctl -a /dev/sda

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-24-generic] (local build)

Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===

Device Model:     Hitachi HDS721010DLE630

Serial Number:    MSE5235V0K8ZKU

LU WWN Device Id: 5 000cca 37cc7dc0a

Firmware Version: MS2OA5R0

User Capacity:    1,000,204,886,016 bytes [1.00 TB]

Sector Sizes:     512 bytes logical, 4096 bytes physical

Device is:        Not in smartctl database [for details use: -P showall]

ATA Version is:   8

ATA Standard is:  ATA-8-ACS revision 4

Local Time is:    Sun Jun 10 19:09:23 2012 WST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status:  (0x80)   Offline data collection activity

               was never started.

               Auto Offline Data Collection: Enabled.

Self-test execution status:      (   0)   The previous self-test routine completed

               without error or no self-test has ever 

               been run.

Total time to complete Offline 

data collection:       ( 8283) seconds.

Offline data collection

capabilities:           (0x5b) SMART execute Offline immediate.

               Auto Offline data collection on/off support.

               Suspend Offline collection upon new

               command.

               Offline surface scan supported.

               Self-test supported.

               No Conveyance Self-test supported.

               Selective Self-test supported.

SMART capabilities:            (0x0003)   Saves SMART data before entering

               power-saving mode.

               Supports SMART auto save timer.

Error logging capability:        (0x01)   Error logging supported.

               General Purpose Logging supported.

Short self-test routine 

recommended polling time:     (   1) minutes.

Extended self-test routine

recommended polling time:     ( 138) minutes.

SCT capabilities:           (0x003d)   SCT Status supported.

               SCT Error Recovery Control supported.

               SCT Feature Control supported.

               SCT Data Table supported.

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000b   099   099   016    Pre-fail  Always       -       65537

  2 Throughput_Performance  0x0005   140   140   054    Pre-fail  Offline      -       76

  3 Spin_Up_Time            0x0007   113   113   024    Pre-fail  Always       -       200 (Average 204)

  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       219

  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0

  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0

  8 Seek_Time_Performance   0x0005   113   113   020    Pre-fail  Offline      -       35

  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       525

 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0

 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       218

192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       219

193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       219

194 Temperature_Celsius     0x0002   230   230   000    Old_age   Always       -       26 (Min/Max 18/46)

196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0

197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1

ATA Error Count: 34 (device log contains only the most recent five errors)

   CR = Command Register [HEX]

   FR = Features Register [HEX]

   SC = Sector Count Register [HEX]

   SN = Sector Number Register [HEX]

   CL = Cylinder Low Register [HEX]

   CH = Cylinder High Register [HEX]

   DH = Device/Head Register [HEX]

   DC = Device Command Register [HEX]

   ER = Error register [HEX]

   ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 34 occurred at disk power-on lifetime: 464 hours (19 days + 8 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 08 60 19 08 00  Error: UNC at LBA = 0x00081960 = 530784

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  60 08 00 60 19 08 40 00      05:01:24.913  READ FPDMA QUEUED

  ef 10 02 00 00 00 a0 00      05:01:24.913  SET FEATURES [Reserved for Serial ATA]

  27 00 00 00 00 00 e0 00      05:01:24.912  READ NATIVE MAX ADDRESS EXT

  ec 00 00 00 00 00 a0 00      05:01:24.912  IDENTIFY DEVICE

  ef 03 46 00 00 00 a0 00      05:01:24.912  SET FEATURES [Set transfer mode]

Error 33 occurred at disk power-on lifetime: 464 hours (19 days + 8 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 08 60 19 08 00  Error: UNC at LBA = 0x00081960 = 530784

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  60 08 00 60 19 08 40 00      05:01:21.709  READ FPDMA QUEUED

  ef 10 02 00 00 00 a0 00      05:01:21.709  SET FEATURES [Reserved for Serial ATA]

  27 00 00 00 00 00 e0 00      05:01:21.709  READ NATIVE MAX ADDRESS EXT

  ec 00 00 00 00 00 a0 00      05:01:21.708  IDENTIFY DEVICE

  ef 03 46 00 00 00 a0 00      05:01:21.708  SET FEATURES [Set transfer mode]

Error 32 occurred at disk power-on lifetime: 464 hours (19 days + 8 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 08 60 19 08 00  Error: UNC at LBA = 0x00081960 = 530784

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  60 08 00 60 19 08 40 00      05:01:18.505  READ FPDMA QUEUED

  ef 10 02 00 00 00 a0 00      05:01:18.505  SET FEATURES [Reserved for Serial ATA]

  27 00 00 00 00 00 e0 00      05:01:18.505  READ NATIVE MAX ADDRESS EXT

  ec 00 00 00 00 00 a0 00      05:01:18.504  IDENTIFY DEVICE

  ef 03 46 00 00 00 a0 00      05:01:18.504  SET FEATURES [Set transfer mode]

Error 31 occurred at disk power-on lifetime: 464 hours (19 days + 8 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 08 60 19 08 00  Error: UNC at LBA = 0x00081960 = 530784

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  60 08 00 60 19 08 40 00      05:01:15.302  READ FPDMA QUEUED

  ef 10 02 00 00 00 a0 00      05:01:15.301  SET FEATURES [Reserved for Serial ATA]

  27 00 00 00 00 00 e0 00      05:01:15.301  READ NATIVE MAX ADDRESS EXT

  ec 00 00 00 00 00 a0 00      05:01:15.301  IDENTIFY DEVICE

  ef 03 46 00 00 00 a0 00      05:01:15.300  SET FEATURES [Set transfer mode]

Error 30 occurred at disk power-on lifetime: 464 hours (19 days + 8 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 08 60 19 08 00  Error: UNC at LBA = 0x00081960 = 530784

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  60 08 00 60 19 08 40 00      05:01:12.098  READ FPDMA QUEUED

  ef 10 02 00 00 00 a0 00      05:01:12.098  SET FEATURES [Reserved for Serial ATA]

  27 00 00 00 00 00 e0 00      05:01:12.097  READ NATIVE MAX ADDRESS EXT

  ec 00 00 00 00 00 a0 00      05:01:12.097  IDENTIFY DEVICE

  ef 03 46 00 00 00 a0 00      05:01:12.097  SET FEATURES [Set transfer mode]

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline    Completed without error       00%       520         -

# 2  Extended offline    Completed without error       00%       306         -

# 3  Extended offline    Completed without error       00%       135         -

# 4  Short offline       Completed without error       00%       132         -

SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

```

Looking online the general advise seems to be to replace the drive. Now the drive is only a couple of months old and still within warranty so replacing it would most likely not be a problem. Digging a little deeper however using badblocks I am not seeing any issue. The drive is part of an MDADM raid1 array. I tried fsck /dev/sda but for some reason fsck is not able to check raid members. So I used badblocks instead, but this didn't find any issue.

```
root@server:~# badblocks -v /dev/sda

Checking blocks 0 to 976762583

Checking for bad blocks (read-only test): done                                                 

Pass completed, 0 bad blocks found. (0/0/0 errors)

```

Even though the error messages from smartctl are concerning, since the test report does not highlight any errors I am even more confused. I read the page on smartctl on badblocks which suggests marking the bad blocks and working around the issue. Since the drives are still relatively new I am reluctant to simply ignore the issue.

Any ideas what to do about this ?

----------

## Logicien

Hello,

some BIOS have a test for hard drives. It could be usefull to use it. Some have an interface to display/analyse SMART data.

If you can use badblocks from a live-cd and use the -n (non destructive) or -w (destructive that erase data) option, read man badblocks, that would be the best test to know if there's some badblocks on the harddrive.

The 34 errors that where logged by the SMART feature of the harddrive are not necessarily related (all) to badblocks, if some can be found.

A deeper knowledge than the mine could be able to understand what was leading to those last five errors and if RAID have something to do with them.

The command smartctl -a /dev/sda do not report any SMART error on my harddrive even if there's a lot of Pre-fail and Old_age TYPE reported.

----------

## eccerr0r

It could have been a prefetch read that failed but then subsequently rewritten and passed just file after the rewrite.  It's hard to say what to do at this point, but likely the HD manufacturer will claim the disk is just fine and not a candidate for warranty service...

I know that Hitachi GST drives require the use of their disk fitness test standalone disk to test their disks for an error code, which is submitted in the RMA.  But DFT has passed on disks that have errors on them...

----------

## lostinspace2011

I am busy running bad sectors -n on both drives in the array. Will report back in 3 days time.

----------

## lostinspace2011

Last sunday | started running badblocks -v -n /dev/sda / b and it is still running. Using iotop I see that is is reading / writing at about 3M/s. Taking this into account and given a 1TB drive I figure the whole process should take about 100 hours (4 days). However now I am on day 5 and it is still running. Looking online it has been suggested that a 1TB drive should take around 72 hours. 

Should I be patient and let this continue. 

Are there any other options to check the drives and verify if they are faulty.

----------

## lostinspace2011

Being a little impatient I stopped bad blocks and ran some other diagnostics. I verified my RAID array which still seemed fine. Then a read the man page for badblocks and came across the -s option. Now I am running badblocks again with it enabled. In 13:45h it completed just over 10%. So to check the entire drive will take around 5.7 days. I should have just waited on the first run.

----------

## lostinspace2011

I started running badblocks at roughly the same time, however sdb seems to be running a lot faster. Any ideas why that could be. Given that both sda and sdb are the same model purchases at the same time I would have expected closer results. It seems that sdb is speeding up as the test progresses. Both drives are part of the same RAID1 array and should contains the same data. 

 *Quote:*   

> root@panda:~# badblocks -n -v -s /dev/sda
> 
> Checking for bad blocks in non-destructive read-write mode
> 
> From block 0 to 976762583
> ...

 

 *Quote:*   

> root@panda:~# badblocks -n -v -s /dev/sdb
> 
> Checking for bad blocks in non-destructive read-write mode
> 
> From block 0 to 976762583
> ...

 

----------

## eccerr0r

I've got disks that have SMART errors logged yet the disk has still lasted a long time...

It's possible that the disk had some trouble reading sectors and hence get behind (see if more errors popped up in the SMART logs!), but also have to consider the disk scheduler sometimes isn't completely fair and some may get the lion's share of the load...

If you're paranoid about it, go ahead and download HGST's drive fitness test.  It will give you a code that will enable RMA if applicable.

----------

## lostinspace2011

I tried running Hitachi DFT however the software didn't recognise my drives / controller. In fact it didn't find any drives present on the problem PC. I tested the DFT disk in on another computer and there it found the drives. Maybe they don't have support for my controller.

----------

## lostinspace2011

The difference in time taken between the drives is growing. sdb is almost 10% ahead of sda.

```
root@panda:~# badblocks -n -v -s /dev/sda

Checking for bad blocks in non-destructive read-write mode

From block 0 to 976762583

Checking for bad blocks (non-destructive read-write test)

Testing with random pattern: 62.42% done, 87:10:31 elapsed. (0/0/0 errors)
```

```
root@panda:~# badblocks -n -v -s /dev/sdb

Checking for bad blocks in non-destructive read-write mode

From block 0 to 976762583

Checking for bad blocks (non-destructive read-write test)

Testing with random pattern: 71.57% done, 87:10:58 elapsed. (0/0/0 errors)
```

----------

## eccerr0r

You might have to take your controllers out of AHCI mode temporarily for DFT to work and place them in legacy mode. This should be a BIOS option.

Make sure you change it back after trying DFT.

----------

## lostinspace2011

Thanks for that suggestion I will try this next, once badblocks has completed

SDA is still busy and only at 90%. SDB finished already after around 130 hours.

```
root@panda:~# badblocks -n -v -s /dev/sdb

Checking for bad blocks in non-destructive read-write mode

From block 0 to 976762583

Checking for bad blocks (non-destructive read-write test)

Testing with random pattern: done                                                 

Pass completed, 0 bad blocks found. (0/0/0 errors)

```

----------

## Ant P.

All attribute counters are showing good values and the only errors there seem to be error responses from an unsupported command. Were you playing around with hdparm by any chance?

----------

## lostinspace2011

Neither Hitachi DFT not badblocks found any errors. I guess I can (will have to) live with the drives a little longer. For DFT to work I had to change the BIOS from AHCI to IDE. Only then did it recognise the drives correctly. Thanks for the pointer.

----------

## lostinspace2011

Today smartctl found this :

Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors

----------

## Herring42

Looking at your original post, I'd say that your drive is beginning to fail.

I don't know if you are aware of how drives work, but I'll assume not  :Wink: 

The drive has a store of spare sectors that are transparently mapped to bad sectors as they are found by the drive. The drive monitors the CRC on the sector, as well as the read current, and so can tell when a sector is about to / has failed. The sector is marked bad, and the data is copied to the spare sector. The relevant SMART line is this:

```
5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0 
```

This would indicate that zero sectors had been reallocated at the time of that report!

Note, this process happens transparently to programs such as badblocks, though running badblocks will cause the drive to discover any more bad sectors. You can only detect them using smartctl.

You can get the drive to perform a surface scan itself using smartctl:

```
smartctl -t long /dev/sdX
```

This will be far faster than badblocks! My 1TB drive completes in just over four hours.

Once the spare sectors have been used up, the drive will start reporting bad sectors that badblocks will pick up. It should be noted that once this stage has been reached, the drive will die very rapidly.

Personally, I like to be safe, rather than sorry and replace the drive once it starts to reallocate sectors, though if you monitor the situation, the drive would probably last a good long time after it's first reallocation. When a drive is accessing sectors that have been reallocated, it will of course be slower, as they are read out of sequence.

Other relevant lines are:

```

196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0 

197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0 

198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0 

```

Read more here: http://en.wikipedia.org/wiki/S.M.A.R.T.#ATA_S.M.A.R.T._attributes

----------

## eccerr0r

keep in mind that using smartctl -t will not remap sectors.  It will merely flag bad sectors only.

The hard drive does not know what sectors are unused (deleted) or used.  If it suddenly could not read a sector, how could it remap the sector without losing data?  The best plan is to just leave a bad sector as it is until the user is notified (meaning tried to read it).  It makes note of this sector in the logs... why? read on...

Badblocks in *destructive* write mode, however, tells the disk that you don't care about any data on the disk because you're writing junk to the disk.  In this case when you're *writing* a block to a bad block, NOW the hard disk knows you don't care about the sector, and will initiate the remap and the user won't know the difference that the sector got changed from under him/her...  It uses the data from the read sector when it failed to know this sector had failed in the past to determine whether or not to remap.

----------

## lostinspace2011

I would have thought that after running bad blocks the driver would have hit every sector and done the remapping.

----------

## eccerr0r

More may have shown up... this is not a good sign.

But make sure you're in *DESTRUCTIVE* mode...

Nondestructive mode can do some strange stuff because it's actually reading.

----------

## Herring42

 *lostinspace2011 wrote:*   

> I would have thought that after running bad blocks the driver would have hit every sector and done the remapping.

 

Indeed so, which is probably why you are now seeing some errors. What does 'smartctl -a /dev/sda' say now?

----------

