# Hard Drives may be going out (solved)

## Bigun

I had a disk fail out (sdb - wouldn't spin up) and I sent it off for replacement.  In the meanwhile I had a spare drive available and put it in for rebuild.  Things were going normally and the rebuild was going at full speed (19000K/s).  Then at about 30% it went to about 1900K/s.  I began to research and found this in the two remaining drives' SMART log:

/dev/sda:

```
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0

195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       1154010999

197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
```

/dev/sdc:

```
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0

195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       1040456479

197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       1

198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
```

And the Hardware_ECC_Recovered on both seem to steadily climb while the rebuild is in progress. 

Is this what is causing the slowdown?

If so I want to replace them, but I have no idea if the manufacturer would warranty based on a low-level diagnostic like this -especially if the drive is still responding.  How would I word this to the manufacturer when I request an RMA?

----------

## NeddySeagoon

Bigun,

The raw data doesn't always mean a great deal. I varies from vendor to vendor and ofter has several data items bit mapped into the same 32 bit field.

Check the drive vendors web site too see if they tell you how to decode the raw data. It may not be Hardware_ECC_Recovered at all.

The data in the three columns ending THRESH are normalised - how thing is done is vendor specific but the interpretation rules are the same everywhere.

When a VALUE is less than or equal to THRESH, that value has failed the SMART test.

Modern hard drives are 'zoned' this means that they have more sectors per track near the outside of the drive than near the spindle. As the drives rotate at a constant rate, the data rate at the ouside is  higher than at the inside.  Expect to see a variation of 2 to 3 times in actual data rates.  Its a feature. not a fault.

Keep an eye on

```
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       1
```

it means that the drive has one sector that it would like to remap.

That something that drives do all through their useful life but just occasionally, the drive will leave it too late and the data is lost because the drive cannot read the sector.

Its not a reason for an RMA unless it gets worse or you have failed reads.

----------

## Bigun

Ok, so why the slowdown then?

----------

## eccerr0r

I'd also see if your PSU is failing since you had two disks have issues recently.

Often, HDD manufacturers wants an error code from their diagnostic tool to start a RMA.  These diagnostic tools often use SMART as well to grab data, you should try these tools to see if they complain about anything.

----------

## NeddySeagoon

Bigun,

The data rate that the drive can sustain drops as the head moves towards the spindle.  There are less sectors per revolution of the platter.

The sustained data rate is defined by the head/platter interface and its not constant because of the zoning.

If you are using the PC for other things - the other things take their toll on bandwidth.  Especially seeks to find/write other data

----------

## Bigun

 *NeddySeagoon wrote:*   

> Bigun,
> 
> The data rate that the drive can sustain drops as the head moves towards the spindle.  There are less sectors per revolution of the platter.
> 
> The sustained data rate is defined by the head/platter interface and its not constant because of the zoning.
> ...

 

Right now the drive set isn't even mounted.

----------

## eccerr0r

Zone bit recording doesn't really do 10:1, but as said, it's more like 2 to 3:1 where it's about 2 to 3x as fast on the outside to the inside rim.  At 10:1 ratio something else is affecting speeds...

What interface are these drives?  On my PATA disks (using libata) I get much faster than even 19MB/sec during RAID5 resync...  Well into the 20MB/s range (my SATA RAID5 is in the 50MB/s range)..

Are these all the same drives?

----------

## NeddySeagoon

My raid5 resync, if I do nothing, stars out at about 130Mb/sec and falls to just over 40Mb/sec

When I use the PC at the same time, the sequential reads and writes that the raid5 resync does are interrupted by seeks for other purposes.

Just reading mail and posting on the forums makes between 10x and 100x speed difference.  These drops are not sustained but /proc/mdstat shows them. The sync speed recovers once its the only operation again.

----------

## Bigun

Started out at 140 Mb/s (SATA 300 drives), and now it's 14 Mb/s.

----------

## NeddySeagoon

Bigun,

Pastebin dmesg.  Provided that resyncing the raid is all its doing, there is something wrong.

Post /proc/mdstat too.

----------

## Bigun

Thank God I got offsite backup of the entire drive before I started rebuild:

```
Personalities : [raid1] [raid6] [raid5] [raid4]

md127 : active raid5 sdb1[3](S) sdc1[4](F) sda1[2]

      2930271872 blocks level 5, 64k chunk, algorithm 2 [3/1] [__U]

md124 : active raid1 sde1[1] sdd1[0]

      96256 blocks [2/2] [UU]

md125 : active raid1 sde2[1] sdd2[0]

      979840 blocks [2/2] [UU]

md126 : active raid1 sde3[1] sdd3[0]

      77074304 blocks [2/2] [UU]

unused devices: <none>
```

dmesg output

Is it done?

----------

## Anon-E-moose

```
[299323.783377] ata5.00: failed command: READ FPDMA QUEUED

[299323.783384] ata5.00: cmd 60/00:00:0f:67:15/04:00:81:00:00/40 tag 0 ncq 524288 in

[299323.783384]          res 51/40:db:34:68:15/00:02:81:00:00/40 Emask 0x409 (media error) <F>

[299323.783387] ata5.00: status: { DRDY ERR }

[299323.783389] ata5.00: error: { UNC }

[299323.796537] ata5.00: configured for UDMA/133

[299323.796592] sd 4:0:0:0: [sdc] Unhandled sense code

[299323.796594] sd 4:0:0:0: [sdc]  

[299323.796596] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE

[299323.796597] sd 4:0:0:0: [sdc]  

[299323.796599] Sense Key : Medium Error [current] [descriptor]

[299323.796602] Descriptor sense data with sense descriptors (in hex):

[299323.796603]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 

[299323.796609]         81 15 68 34 

[299323.796612] sd 4:0:0:0: [sdc]  

[299323.796615] Add. Sense: Unrecovered read error - auto reallocate failed

[299323.796616] sd 4:0:0:0: [sdc] CDB: 

[299323.796617] Read(10): 28 00 81 15 67 0f 00 04 00 00

[299323.796623] end_request: I/O error, dev sdc, sector 2165663796
```

Looks like your drive is giving up the ghost, so to speak.

Get the Samsung disk took and have it run and see what it says. 

As is said above most disk manufacturers will want to see that before an RMA is issued.

They will also be able to tell you if the disk is in warranty (according to them)

Good luck

Note: I had a Hitachi that had sectors pending and I reformatted the whole drive with their disk utility 

and it reallocated things properly and it went from over 100 "reallocated" sectors to 24 

and has held steady at that for the last several months.

YMMV

----------

## NeddySeagoon

Bigun,

```
Personalities : [raid1] [raid6] [raid5] [raid4]

md127 : active raid5 sdb1[3](S) sdc1[4](F) sda1[2]

      2930271872 blocks level 5, 64k chunk, algorithm 2 [3/1] [__U] 
```

shows that you have a three disk raid5 set with only one drive. Thats really bad news. 

dmesg shows

```
[299323.796623] end_request: I/O error, dev sdc, sector 2165663796

[299323.796626] md/raid:md127: read error not correctable (sector 2165663728 on sdc1).

[299323.796629] md/raid:md127: Disk failure on sdc1, disabling device.

[299323.796629] md/raid:md127: Operation continuing on 1 devices.
```

The recovery failed and /dev/sdc1 shows

```
[299323.796632] md/raid:md127: read error not correctable (sector 2165663736 on sdc1).

[299323.796634] md/raid:md127: read error not correctable (sector 2165663744 on sdc1).

[299323.796635] md/raid:md127: read error not correctable (sector 2165663752 on sdc1).

[299323.796637] md/raid:md127: read error not correctable (sector 2165663760 on sdc1).

[299323.796639] md/raid:md127: read error not correctable (sector 2165663768 on sdc1).

[299323.796641] md/raid:md127: read error not correctable (sector 2165663776 on sdc1).

[299323.796642] md/raid:md127: read error not correctable (sector 2165663784 on sdc1).

[299323.796644] md/raid:md127: read error not correctable (sector 2165663792 on sdc1).

[299323.796646] md/raid:md127: read error not correctable (sector 2165663800 on sdc1).
```

a bad patch on the drive.

----------

## Bigun

 *NeddySeagoon wrote:*   

> Bigun,
> 
> ```
> Personalities : [raid1] [raid6] [raid5] [raid4]
> 
> ...

 

Like I said, good thing I had backup.  I did a fresh backup before I unmounted and started the rebuild actually.

I had ordered another drive already via premonition that something like this would happen.  That explains the slowdown then.

Neddy,

You may wanna start adding links to your signature of people who benefited from keeping up to date backups.

----------

## eccerr0r

Yay, saved by the backup.

I found when having RAID:

1. RAID is not a replacement for backups.  This is a very important point people neglect.

2. A hot spare is nice.  RAID6 counts toward a hot spare.

3. Having a COLD spare on hand helps a LOT - it closes your window if you don't have a hot spare and since it has less hours on it, it will have different failure characteristics than the ones that have been powered up and spinning.  Test your cold spare before putting it on the shelf.

4. Having good PSUs is also mandatory...  You don't want a common, bad PSU killing all your drives at the same time...

Unfortunately PSU testers are hard to come by...  Really need an o-scope and a load board to test them correctly.

----------

## NeddySeagoon

eccerr0r,

... and that only gets you static testing.  Dynamic testing is just as important in a PC.

A good rule of thumb is to by mid priced PSUs.  They are a commodity and you get what you pay for.

----------

## Bigun

Real quick, I'm building the new array, it's was going really fast, averaging about 40-60MB/s.  Then slows down to about 4MB/s.  I check out the lights and a temp spare drive I grabbed from work, it's HD light seems to be staying on.  I check smartctl on that drive:

```
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0

197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0

```

Nothing seems weird, but I'm wondering if I need to RMA this thing for work.  Here is the whole smartctl entry for that drive.

----------

## NeddySeagoon

Bigun

dmesg and /proc/mdstat would be useful

Your SMART does have one warning sign .... Western Digital Green.

----------

## Bigun

 *NeddySeagoon wrote:*   

> Bigun
> 
> dmesg and /proc/mdstat would be useful
> 
> Your SMART does have one warning sign .... Western Digital Green.

 

It's sped back up to about 11MB/s:

```
Personalities : [raid1] [raid6] [raid5] [raid4]

md127 : active raid5 sde1[3] sdb1[1] sda1[0]

      2930269184 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_]

      [============>........]  recovery = 61.2% (897263620/1465134592) finish=861.6min speed=10983K/sec

```

----------

## NeddySeagoon

Bigun,

Look in dmesg for failures and recovery that indicate write retries.

This involves recalibrating the head, which usually produces an audible 'click'  

The click of death, is produced by constant recalibrates.

----------

## Bigun

 *NeddySeagoon wrote:*   

> Bigun,
> 
> Look in dmesg for failures and recovery that indicate write retries.
> 
> This involves recalibrating the head, which usually produces an audible 'click'  
> ...

 

I don't see anything, also the RAID finished.  The WD Green drive is temporary until my other drive gets RMA'd.  I also don't plan on writing anything to the drive until the RMA'd drive gets here and the RAID rebuilds again.  I just need it to stream media to the WDTV live.   :Very Happy: 

----------

## Bigun

Oh FFS, it's the WD Green drive, but come on!

During recovery of the data on the RAID set:

```
[219380.354814] ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

[219380.354819] ata5.00: failed command: SMART

[219380.354823] ata5.00: cmd b0/d0:01:00:4f:c2/00:00:00:00:00/00 tag 0 pio 512 in

[219380.354823]          res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)

[219380.354825] ata5.00: status: { DRDY }

[219380.354830] ata5: hard resetting link

[219385.700776] ata5: link is slow to respond, please be patient (ready=0)

[219387.534026] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

[219387.560506] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20120320/psargs-359)

[219387.560512] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT4._GTF] (Node f54571c8), AE_NOT_FOUND (20120320/psparse-536)

[219387.565768] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20120320/psargs-359)

[219387.565774] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT4._GTF] (Node f54571c8), AE_NOT_FOUND (20120320/psparse-536)

[219387.566030] ata5.00: configured for UDMA/133

[219387.566045] ata5: EH complete

[228367.972299] ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

[228367.972305] ata5.00: failed command: SMART

[228367.972312] ata5.00: cmd b0/d0:01:00:4f:c2/00:00:00:00:00/00 tag 0 pio 512 in

[228367.972312]          res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)

[228367.972315] ata5.00: status: { DRDY }

[228367.972330] ata5: hard resetting link

[228373.318358] ata5: link is slow to respond, please be patient (ready=0)

[228378.004336] ata5: COMRESET failed (errno=-16)

[228378.004343] ata5: hard resetting link

[228383.350411] ata5: link is slow to respond, please be patient (ready=0)

[228388.035392] ata5: COMRESET failed (errno=-16)

[228388.035399] ata5: hard resetting link

[228393.381463] ata5: link is slow to respond, please be patient (ready=0)

[228411.053153] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

[228411.062538] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20120320/psargs-359)

[228411.062544] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT4._GTF] (Node f54571c8), AE_NOT_FOUND (20120320/psparse-536)

[228411.080905] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20120320/psargs-359)

[228411.080911] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT4._GTF] (Node f54571c8), AE_NOT_FOUND (20120320/psparse-536)

[228411.081781] ata5.00: configured for UDMA/133

[228411.081795] ata5: EH complete

[229681.016622] ata5.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen

[229681.016628] ata5.00: failed command: READ FPDMA QUEUED

[229681.016634] ata5.00: cmd 60/68:00:d7:01:fc/02:00:aa:00:00/40 tag 0 ncq 315392 in

[229681.016634]          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)

[229681.016637] ata5.00: status: { DRDY }

[229681.016651] ata5: hard resetting link

[229683.969236] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

[229683.982469] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20120320/psargs-359)

[229683.982475] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT4._GTF] (Node f54571c8), AE_NOT_FOUND (20120320/psparse-536)

[229684.001288] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20120320/psargs-359)

[229684.001294] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT4._GTF] (Node f54571c8), AE_NOT_FOUND (20120320/psparse-536)

[229684.001455] ata5.00: configured for UDMA/133

[229684.001475] sd 4:0:0:0: [sdb]

[229684.001477] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE

[229684.001480] sd 4:0:0:0: [sdb]

[229684.001482] Sense Key : Aborted Command [current] [descriptor]

[229684.001485] Descriptor sense data with sense descriptors (in hex):

[229684.001487]         72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00

[229684.001502]         00 00 00 00

[229684.001505] sd 4:0:0:0: [sdb]

[229684.001507] Add. Sense: No additional sense information

[229684.001508] sd 4:0:0:0: [sdb] CDB:

[229684.001509] Read(10): 28 00 aa fc 01 d7 00 02 68 00

[229684.001515] end_request: I/O error, dev sdb, sector 2868642263

[229684.001560] ata5: EH complete
```

I'm starting to wonder about the PSU, any recommended PSU testers?

If it does wind up being the PSU, any recommended PSU?  I only need 300W or so.

----------

## Akkara

A few things off the top of my head that could be exacerbating your problems:

(1) Loose, poorly-contacting power connector (that plugs into the drive)

(2) Too many drives on any given power connector string (the ones toward the end are often affected).  Try to keep it to 2 drives per connector chain.

(3) Loose or jiggle-able sata connector (either on the drive, or on the motherboard.  I strap mine down with velcro ties)

(4) Drives that spin down, and later induce a momentary power sag when spinning back up (often goes along with points (1) or (2)).

Many of these are "power supply problems" but not necessarily a bad power supply itself.  Although a good quality power supply might fix the issues, it might simply be a case of the thicker gauge wires it comes with better handles having lots of drives on a connector chain.

And regardless of gauge, wire inductance is still there (dependent mostly on length and especially if it's been coiled up into a "neat bundle").  I've seen many a "hard resetting link" message coming in just as I spin up a drive that's been inserted into the removable caddy.  I wasn't running raid at the time so it only caused a momentary hiccup.  But the only way I got rid of it, was to power the removable caddy from its own dedicated cable from the power supply.  You might be having similar issues, in addition to bad drives.

----------

## NeddySeagoon

Bigun,

You may only need 300w steady state but thats total steady state.

Most PSUs cannot deliver their full rated output because one voltage, or a combination of two, hits its power limit first.

e.g. a random PSU, pulled out of my gander box says

+5v 32A   160w

+3.3v 20A  66w

+12v 16A  192w

-12v 0.8A

-5v 0.3A

5v Stby 3A

That looks like a 300W PSU ... but it goes on in smaller print  ...

5v and 3.3v shall not exceed 165w

5v+3.3v+12v shall not exceed 280w

Its these combinations that are the limiting factor, not the advertised load that  a PSU can support.

PSU derating is a good thing too, the PSU runs cooler, lasts longer and produces better regulation.

In short, if you need 300w, don't look at any PSUs under 500w and read the fine print.  

Factor in the drive motor stalled load too ... thats about 2A on the 12v.  Drive labels usually only give the steady state current.

----------

## eccerr0r

I had one drive that kept on getting kicked from my 500x4 array that upon checking by itself, always turned out good.  Since I keep on swapping sata cables around I never noticed a pattern that it was always one particular hotswap bay that kept on "failing" but eventually I rootcaused it to a poorly fitting SATA power connector at the hotswap bay (not internal to the hotswap bay).   Cleaning the connector with connector cleaner fixed the issue.

I feel that I have the same problem with one of the drives in my molex powered 120x4 array...  After many years of using these connectors they start to feel loose, and that's a bad sign...  I hope the kicked 120G disk survives another 30K hours and reach 100K power on hours and beyond :D

----------

## Bigun

I've got the drives stored in one of these. All 5 bays are being used.

I shut the machine down and I noticed the bay required two power plugs and I had two from one power string plugged in.  I unplugged one and plugged in another from a different string.

I also pulled each drive out and blew out the dust.

I've booted the machine back up and I've re-added the WD Green drive to see what happens.

----------

## Bigun

This is much more promising.

The rebuild is going much, much faster, I haven't seen it go below 10 MB/s, and it's averaging about 20+ MB/s.

I hope this fixes the constant HD failures.

----------

## Bigun

Just following up:  This was not enough.  After looking up some details on the drives I was using and the peak power usage between all 5, a 350 watt was not enough.  I bought an Antec 550 watt and now it performs much more stable than it ever has before.

----------

