# Time to put this drive to rest?

## Zucca

My home server had a huge load spike. I went to investigate:

```
SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       1817

  3 Spin_Up_Time            0x0027   173   173   021    Pre-fail  Always       -       2308

  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       79

  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0

  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0

  9 Power_On_Hours          0x0032   075   075   000    Old_age   Always       -       18885

 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0

 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0

 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       78

192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       34

193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       177

194 Temperature_Celsius     0x0022   115   098   000    Old_age   Always       -       28

196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0

197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       1

198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0

200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       2

```

```
2018-03-16T09:35:03+0200 [kernel] [5929925.151666] BTRFS error (device sdc4): bdev /dev/sdc4 errs: wr 0, rd 842, flush 0, corrupt 0, gen 0

2018-03-16T09:35:03+0200 [kernel] [5929925.151674] BTRFS error (device sdc4): bdev /dev/sdc4 errs: wr 0, rd 843, flush 0, corrupt 0, gen 0

2018-03-16T09:35:03+0200 [kernel] [5929925.151677] BTRFS error (device sdc4): bdev /dev/sdc4 errs: wr 0, rd 844, flush 0, corrupt 0, gen 0

2018-03-16T09:35:03+0200 [kernel] [5929925.151681] BTRFS error (device sdc4): bdev /dev/sdc4 errs: wr 0, rd 845, flush 0, corrupt 0, gen 0

2018-03-16T09:35:03+0200 [kernel] [5929925.151684] BTRFS error (device sdc4): bdev /dev/sdc4 errs: wr 0, rd 846, flush 0, corrupt 0, gen 0

2018-03-16T09:35:03+0200 [kernel] [5929925.151688] BTRFS error (device sdc4): bdev /dev/sdc4 errs: wr 0, rd 847, flush 0, corrupt 0, gen 0

2018-03-16T09:35:03+0200 [kernel] [5929925.151691] BTRFS error (device sdc4): bdev /dev/sdc4 errs: wr 0, rd 848, flush 0, corrupt 0, gen 0

2018-03-16T09:35:07+0200 [kernel] [5929929.561469] BTRFS info (device sdc4): read error corrected: ino 353768 off 99848192 (dev /dev/sdc4 sector 1342830504)

2018-03-16T09:35:07+0200 [kernel] [5929929.561522] BTRFS info (device sdc4): read error corrected: ino 353768 off 99856384 (dev /dev/sdc4 sector 1342830520)

2018-03-16T09:35:07+0200 [kernel] [5929929.561528] BTRFS info (device sdc4): read error corrected: ino 353768 off 99852288 (dev /dev/sdc4 sector 1342830512)

2018-03-16T09:35:07+0200 [kernel] [5929929.561574] BTRFS info (device sdc4): read error corrected: ino 353768 off 99860480 (dev /dev/sdc4 sector 1342830528)

2018-03-16T09:35:07+0200 [kernel] [5929929.561642] BTRFS info (device sdc4): read error corrected: ino 353768 off 99864576 (dev /dev/sdc4 sector 1342830536)

2018-03-16T09:35:07+0200 [kernel] [5929929.561698] BTRFS info (device sdc4): read error corrected: ino 353768 off 99868672 (dev /dev/sdc4 sector 1342830544)

2018-03-16T09:35:07+0200 [kernel] [5929929.561770] BTRFS info (device sdc4): read error corrected: ino 353768 off 99872768 (dev /dev/sdc4 sector 1342830552)
```

Also

```
[kernel] [5929925.150153] ata3.00: exception Emask 0x0 SAct 0x610 SErr 0x0 action 0x0

[kernel] [5929925.150156] ata3.00: irq_stat 0x40000008

[kernel] [5929925.150160] ata3.00: failed command: READ FPDMA QUEUED

[kernel] [5929925.150168] ata3.00: cmd 60/38:20:a8:ff:ed/00:00:58:00:00/40 tag 4 ncq dma 28672 in

[kernel] [5929925.150168]          res 41/40:00:aa:ff:ed/00:00:58:00:00/00 Emask 0x409 (media error) <F>

[kernel] [5929925.150170] ata3.00: status: { DRDY ERR }

[kernel] [5929925.150172] ata3.00: error: { UNC }

[kernel] [5929925.151625] ata3.00: configured for UDMA/133

[kernel] [5929925.151647] sd 2:0:0:0: [sdc] tag#4 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08

[kernel] [5929925.151651] sd 2:0:0:0: [sdc] tag#4 Sense Key : 0x3 [current]

[kernel] [5929925.151654] sd 2:0:0:0: [sdc] tag#4 ASC=0x11 ASCQ=0x4

[kernel] [5929925.151658] sd 2:0:0:0: [sdc] tag#4 CDB: opcode=0x28 28 00 58 ed ff a8 00 00 38 00

[kernel] [5929925.151661] blk_update_request: I/O error, dev sdc, sector 1491992490

[kernel] [5929925.151666] BTRFS error (device sdc4): bdev /dev/sdc4 errs: wr 0, rd 842, flush 0, corrupt 0, gen 0
```

Then follows the usual btrfs errors.

It's pretty old 1TB (I guess) WB Blue spinning platter. I can drop it out from the btrfs pool and the raid-1 array too. No problem.

I'm more interested of the smart data above.

Current_Pending_Sector with value of 1 and Multi_Zone_Error_Rate with value of 2 seem to indicate of impending total failure of the drive. Right?

----------

## bunder

one pending sector isn't really a whole lot to worry about.

i'd be more concerned that smart found one problem but btrfs found many consecutive errors.

theoretically you could try wiping the drive and keep using it, but when in doubt throw it out.   :Laughing: 

----------

## Zucca

 *bunder wrote:*   

> when in doubt throw it out.  

 I already made an order for 2TB Toshiba and 2TB WD RED.

I might as well grow hard disk space at the same time... Or left the other as a spare.

Raw_Read_Error_Rate value of that drive is just too high for me to accept.  :Razz: 

----------

## mike155

 *Quote:*   

> Raw_Read_Error_Rate value of that drive is just too high for me to accept. 

 

You're kidding, aren't you?  :Wink:  Look at the value on my Seagate ST32000644NS hard disk:

```
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000f   083   063   044    Pre-fail  Always       -       204787750

  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0

  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       76

  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       1

  7 Seek_Error_Rate         0x000f   076   060   030    Pre-fail  Always       -       48133305

  9 Power_On_Hours          0x0032   050   050   000    Old_age   Always       -       44329

```

The drive works perfectly fine. A high value for Raw_Read_Error_Rate means nothing - at least not on Seagate drives.

You could do a 'dd if=/dev/sdX of=/dev/null bs=10M' to test your drive. It will take a couple of hours, but if you don't get any errors, you'll know that the drive is ok.

----------

## Zucca

Strange.

All the other drives I have (five more) have Raw_Read_Error_Rate between 0 and 2.

With one exception of 6, which is also WD Blue 1TB. But it seems to be about half of the age of the other...

Also one of my drives, WB BLUE 2TB has Load_Cycle_Count of 230395, while on others its under 500.

----------

## mike155

Unfortunately, many of the SMART parameters and values are mostly meaningless, because they are not standardized.

The only SMART parameters that seem to be useful to (pre-) detect a drive failure are: Reallocated_Sector_Ct and Current_Pending_Sector.

A high value for Load_Cycle_Count may indicate trouble. Look at the data sheet of your drive, the number of allowed load cycles should be specified. High values typically mean that the drive supports APM (Advanced Power Management). I try to avoid such drives, at least for servers. Use '/sbin/hdparm -B /dev/sdX' to check if your drive supports APM. If you want, you can disable APM using '/sbin/hdparm -B 255 /dev/sdX'. After you disabled APM, Load_Cycle_Count should stop rising.

EDIT: I just looked at the specification sheet of WD Blue 2TB drives. It specifies '300.000' load cycles. If your current value is 230395, you definitely should do something!

----------

## Zucca

 *mike155 wrote:*   

> Unfortunately, many of the SMART parameters and values are mostly meaningless, because they are not standardized.

 I've always wondered why. Every drive manufacturer supports smart, but the values are some sort of guess play. Bah! Luckily I have something to compare on. All my drives are WD.

 *mike155 wrote:*   

> I just looked at the specification sheet of WD Blue 2TB drives. It specifies '300.000' load cycles. If your current value is 230395, you definitely should do something!

  *Zucca wrote:*   

> I already made an order for 2TB Toshiba and 2TB WD RED.
> 
> I might as well grow hard disk space at the same time... Or left the other as a spare.

 ... Will be interesting to see how the smart values on the Toshiba one evolve...

Also I'll recheck my hdparm configurations. Thanks.

----------

## P.Kosunen

 *mike155 wrote:*   

> A high value for Load_Cycle_Count may indicate trouble.

 

On Greens i have seen millions, i wouldn't worry about couple hundred thousand.

http://idle3-tools.sourceforge.net/

Could try to increase parking time a bit.

----------

## frostschutz

 *bunder wrote:*   

> one pending sector isn't really a whole lot to worry about.

 

that is what the hard drive vendors want to make you believe.

a hard drive is supposed to store data - not lose it. with one pending sector, it already lost data. that's not acceptable.

I'd replace the drive. If there is no backup, ddrescue. Once the drive is removed / ddrescue'd, you can do a destructive badblocks and decide whether it's worth giving it another shot or not. Either way, I would no longer trust it with important data.

idle3 is built into hdparm as well (-J) - I have used it on my WD Green drives and they lived for a long time... (still running) ...but I don't know if that's just luck or in anyway related to idle3. There is a lot of panic about this but no reports of massive failures (like deathstar et al.)

----------

## Jaglover

You may want to do something like dd if=/dev/sdx of=/dev/null conv=noerror bs=1M, it will try and read every sector revealing all bad ones.

----------

## NeddySeagoon

Zucca,

```
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE 197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       1 
```

The drive has lost data already and knows it.

Run the long self test. That Pending Sector count might get worse.

Raw values are often packed bit field, so big numbers are not always a cause for concern.

The VALUE WORST and THRESH are nornalised,

If VALUE or WORST is <=  THRESH, that smart parameter has failed.

You have a drive that can't read its own writing.

----------

## Ant P.

Here's the WD Green in my desktop for comparison -

```
SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0

  3 Spin_Up_Time            0x0027   173   155   021    Pre-fail  Always       -       6308

  4 Start_Stop_Count        0x0032   097   097   000    Old_age   Always       -       3307

  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0

  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0

  9 Power_On_Hours          0x0032   042   042   000    Old_age   Always       -       42934

 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0

 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0

 12 Power_Cycle_Count       0x0032   097   097   000    Old_age   Always       -       3287

192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       35

193 Load_Cycle_Count        0x0032   199   199   000    Old_age   Always       -       3307

194 Temperature_Celsius     0x0022   117   105   000    Old_age   Always       -       33

196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0

197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0

200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0
```

The values that are non-zero on yours (1, 197, 200) definitely point to a failing drive. Multi-zone error could indicate it suffered a head crash, however unlikely they may be nowadays. It's even more improbable that the situation will get better from here however.

----------

## Zucca

Current_Pending_Sector is now at 0. Other critical numbers haven't changed.

I have done nothing yet. I'll wait till Monday/Tuesday for the new disks.

Meanwhile I start pulling out that one disk from the system... on software side of things, I mean. I have redundancy on all the data, so pulling one from the system isn't much of a task. It just takes some time to rebalance itself.

----------

## NeddySeagoon

Zucca,

If the reallocated sector count did not change, the drive read the sector and was happy with the result.

If the reallocated sector count has increased, the drive got a good read and moved the data.

The reallocated sector count is supposed to increase as the drive ages and data from difficult to read sectors is moved.

The pending sector count should always be zero. Thats a count or the sectors the drive knows it can't read.

A long test may be informative. The drive will read the entire data area without any IO.

----------

## Jaglover

```
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000f   100   100   046    Pre-fail  Always       -       119150

  2 Throughput_Performance  0x0005   100   100   030    Pre-fail  Offline      -       12910592

  3 Spin_Up_Time            0x0003   100   100   025    Pre-fail  Always       -       1

  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       120

  5 Reallocated_Sector_Ct   0x0033   100   100   024    Pre-fail  Always       -       0 (2000 0)

  7 Seek_Error_Rate         0x000f   100   100   047    Pre-fail  Always       -       903

  8 Seek_Time_Performance   0x0005   100   100   019    Pre-fail  Offline      -       0

  9 Power_On_Hours          0x0032   007   007   000    Old_age   Always       -       46663

 10 Spin_Retry_Count        0x0013   100   100   020    Pre-fail  Always       -       0

 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       120

192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       65

193 Load_Cycle_Count        0x0032   071   071   000    Old_age   Always       -       580829

194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       39 (Min/Max 22/57)

195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       27

196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0 (0 6924)

197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x003e   200   253   000    Old_age   Always       -       0

200 Multi_Zone_Error_Rate   0x000f   100   100   060    Pre-fail  Always       -       7741

203 Run_Out_Cancel          0x0002   100   100   000    Old_age   Always       -       429512721134

240 Head_Flying_Hours       0x003e   200   200   000    Old_age   Always       -       0

```

This drive is on 24x7, it has been running like this for at least 2 years. I wait all the time for it to fail, but it keeps running. Shall I take a hammer?

----------

## Zucca

 *frostschutz wrote:*   

> idle3 is built into hdparm as well (-J) - I have used it on my WD Green drives and they lived for a long time... (still running) ...but I don't know if that's just luck or in anyway related to idle3. There is a lot of panic about this but no reports of massive failures (like deathstar et al.)

 I have WD Greens also (head parking adjusted). They've been working flawlessly. smart data shows no signs of aging. I only see two WD Blues going down. The other one does not error out, but it has the head parking count of 230k.

I've now removed the faulty drive from raid1 arrays and btrfs pool removal is going at the moment. I wonder it btrfs balances the data among the rest of the drives now, as the removal is taking a long time...

After that I can run the long test for the drive reporting errors.

----------

## Zucca

Finally.

I did full balancing of the btrfs pool. Started at 2018-03-17T22:40:04 and ended at 2018-03-19T00:40:57. I knew it would take some time, but I disragarded the warning. Silly me. :P

Next time I'll adjust the balancing filters. Anyway. This means I don't need to start using my backups at the moment. Everything's fine. Next I'll run the long smart tests.

----------

