# hard drive access freezing

## ExecutorElassus

So, about three days ago, I started getting a weird problem: the hdd access LED would light up, and the computer would basically freeze. I could do things that didn't use the hard drive (like, say, swapping desktops), but anything else waited until the hard drive stopped reading. This happened intermittently, about every minute (but not in any way regularly), and moreso when using memory-intensive programs (most notably EVE-Online, which usually uses up all of my RAM). Anyway, the point is, the system freezes up intermittently, and I don't know why.

On the program 'top' under CPU, I see this:

```
Cpu(s):  1.3%us,  1.0%sy,  0.0%ni, 49.2%id, 47.7%wa,  0.7%hi,  0.2%si,  0.0%st
```

The "wa" line is what's interesting here. This always runs up to about 50% usage (I have a dual-core, so I assume that's one CPU core running at full) when the system is nonresponsive.

I've run 'fsck -c' on most partitions, and I'm continuing to do so. It hasn't found any problems so far. 

Any suggestions what this is? 

Thanks,

EE

----------

## Bigun

Do you have SMART enabled on your HD?

----------

## ExecutorElassus

I don't know if I have SMART enabled or not.

So, now it's much worse.  I ran fsck, which hung at 50% for a few hours. Then it started repairing. But then the system started breaking (I got a "Bus Failure" when I tried to run a command on another term, and couldn't log in on a third term). When I rebooted, none of the RAID arrays would load: on boot, I got the following:

```
ata1: softreset failed (device not ready)
```

and then wouldn't mount any of the striped RAID arrays. Since /usr is on a striped RAID array, I can't run any command. However, /init is on a mirrored array, so I can run init scripts.

Is there any way to start the RAID array, so that I can get the drives mounted? what are the options for debugging? None of the lvm devices that should be at /dev/vg/[whatever] exist, so could it be that the raid controller is simply failing?

This is horrible. Please help me get my box back!

Thanks,

EE

----------

## eccerr0r

Looks like you had an unhealthy disk.  Looks like lvm failed, did your mdadm arrays start up at all?  What do you normally do to start your array?  Any boot logs?

What level RAID are you using?  If you're using RAID0 then likely you may be looking to have some quality time with your backups.

Does /proc/mdstat reveal any bad disks?

----------

## ExecutorElassus

well, now I'm just confused. 

The RAID arrays are started at boot time. I couldn't see anything on /proc/mdstat ('less' was on the /usr/bin partition, which wasn't mounted), so I turned the box off in despair.

Then!

I wake up, turn the box back on, and Lo! the RAID arrays all mount fine. No errors, and /proc/mdstat reports all arrays working.

So, what's going on here? Could the drive have simply overheated? The RAID controller? The weird error I was having - the system freezing on IOWAIT - isn't happening anymore. Regardless, I'm going to get myself a backup drive (what's that old adage about the two types of people with regards to backups and hard drive failures?), and prepare for the things to fail. I'd nevertheless like some help in figuring out exactly what went (might still go?) wrong.

Sometimes I wish my life were less interesting.

EE

----------

## Bigun

Sounds like one of your drives is giving you trouble.  This is one of the reasons I'm leery about software arrays.

----------

## ExecutorElassus

Yeah, I suspect as much as well. at the least, drive is puking when it gets warm, which is not good.

So, I suppose I should get myself a (hardware?) RAID array, and start moving stuff over. I've never used one before, though; I assume they can be configured by their own internal BIOS at boot time?

Gee, good thing Christmas is coming! I need new stuff!

Thanks,

EE

----------

## dtjohnst

 *ExecutorElassus wrote:*   

> Yeah, I suspect as much as well. at the least, drive is puking when it gets warm, which is not good.
> 
> So, I suppose I should get myself a (hardware?) RAID array, and start moving stuff over. I've never used one before, though; I assume they can be configured by their own internal BIOS at boot time?
> 
> Gee, good thing Christmas is coming! I need new stuff!
> ...

 

Since you're using RAID0 a hardware controller wouldn't help you much. The only reason, in my opinion, to use hardware over software raid is if disk performance is critical or if you have some absurdly complex (ie multinational corporation with clustered servers and failovers). I've been using software RAID at home for years. You run with hardware raid and the raid controller fails, you have to get a compatible raid controller or all your data is gone. If you're big business buying a $1400 RAID card, that's not a problem. If you're buying a $200 RAID card at your local supplier, you could find yourself in a rough spot. Not to mention eventually that hardware WILL be phased out, and you'll need to double your storage space to copy your data over to the new hardware. Software, at least FULL software not that fake-raid you get built on to a desktop motherboard, will likely be supported for a long. long time before you're forced to do that. Of course, software raid CAN be a bit buggy, but so can a $200 RAID card. The ones worth having certainly break my bank account.

In my opinion, stick with software. If you're using FAKE raid (built on to a motherboard), I'd migrate off that and onto mdadm.

I'm one of those guys who always has a spare drive. Didn't help me when both drives in my RAID1 gave out at the same time though. Luckily I store data separately with more backup than just RAIDed disks. All I lost was my operating systems. If you've got some network storage or something, you might want to make a tarball of your system and copy it over as a backup for now just in case, then check in your BIOS to see if you have smart. If so, take a peek and see if SMART is reporting any problems.

----------

## ExecutorElassus

one last question: how can I know that this is the drive itself failing, and not some component on the motherboard? I've been running it a few days, and I observe the following:

it does fine for hours if I don't do anything intensive

But!

the same IOWAIT thing becomes more and more prevalent over time if I'm doing something intensive (like, playing EVE. Damn you internet spaceships game!)

if I shut down, and come back the next morning, the problem is gone

and, booting the BIOS takes longer (like, getting through the BIOS splash screen to the bootloader, etc) when the drives are acting balky

Is it possible that some part of the motherboard itself is overheating? I use water cooling, so there are fewer fans putting air on the board. 

Just curious. I'm buying myself a big fat backup drive today anyway.

Thanks,

EE

----------

## dtjohnst

If you have SMART and that's reporting errors, then you can be fairly certain that's where the problem is.

----------

## ExecutorElassus

So, I enabled SMART, and installed smartmon, and ran the query. Both drives passed the test, thus:

```
 # smartctl -H /dev/sdb

smartctl version 5.38 [x86_64-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen

Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

```

And the same for /sda.

So, if it isn't the drives, what is it? Could the RAID controller, or some other part of the mobo, be overheating? The problem gets much worse when I'm using memory/cache/drive intensive stuff.

What's the next step?

Thanks,

EE

----------

## dtjohnst

Passing SMART doesn't guarantee that the drives are good. In fact, when I suspect a bad drive, I always pray they fail SMART, because then the problem is obvious.

In your case, the fact that you only have problems when things heat up may mean the problem is somewhere else. Memory, CPU, etc. I would push your computer until the problems come back and then try SMART again. See if you get a different result.

Personally, if that still comes up fine, I'd throw a different drive in push that just as hard. If the problem persists, I'd keep replacing parts, cheapest first, until it starts working again. I don't know how old your hard drives are, but consider most hard drives have a life of 3-5 years, so if they're only than 3 or 4 years old, it wouldn't hurt to replace them anyways.

----------

## Sysa

 *ExecutorElassus wrote:*   

> So, I enabled SMART, and installed smartmon, and ran the query. Both drives passed the test, thus:
> 
> ```
>  # smartctl -H /dev/sdb
> 
> ...

 

Next step is to publish your

```
smartctl -a /dev/sdX :)
```

 and run smart tests at least once.

But be careful! If our disk(s) is/are bad the system will hang!

So, I suggest to publish the reports (above) first.

BTW: how powerful is your power supply?

----------

## ExecutorElassus

/dev/sda:

```
# smartctl -a /dev/sda

smartctl version 5.38 [x86_64-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen

Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===

Model Family:     Seagate Barracuda 7200.10 family

Device Model:     ST3750640AS

Serial Number:    5QD04ND1

Firmware Version: 3.AAK

User Capacity:    750,156,374,016 bytes

Device is:        In smartctl database [for details use: -P show]

ATA Version is:   7

ATA Standard is:  Exact ATA specification draft version not indicated

Local Time is:    Sat Dec  5 23:25:24 2009 CET

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status:  (0x82) Offline data collection activity

                                        was completed without error.

                                        Auto Offline Data Collection: Enabled.

Self-test execution status:      (   0) The previous self-test routine completed

                                        without error or no self-test has ever 

                                        been run.

Total time to complete Offline 

data collection:                 ( 430) seconds.

Offline data collection

capabilities:                    (0x5b) SMART execute Offline immediate.

                                        Auto Offline data collection on/off support.

                                        Suspend Offline collection upon new

                                        command.

                                        Offline surface scan supported.

                                        Self-test supported.

                                        No Conveyance Self-test supported.

                                        Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

                                        power-saving mode.

                                        Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

                                        General Purpose Logging supported.

Short self-test routine 

recommended polling time:        (   1) minutes.

Extended self-test routine

recommended polling time:        ( 202) minutes.

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000f   091   087   006    Pre-fail  Always       -       0

  3 Spin_Up_Time            0x0003   093   093   000    Pre-fail  Always       -       0

  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       109

  5 Reallocated_Sector_Ct   0x0033   097   097   036    Pre-fail  Always       -       133

  7 Seek_Error_Rate         0x000f   084   060   030    Pre-fail  Always       -       274298429

  9 Power_On_Hours          0x0032   090   090   000    Old_age   Always       -       9050

 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0

 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       109

187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       1307

189 High_Fly_Writes         0x003a   098   098   000    Old_age   Always       -       2

190 Airflow_Temperature_Cel 0x0022   053   050   045    Old_age   Always       -       47 (Lifetime Min/Max 24/48)

194 Temperature_Celsius     0x0022   047   050   000    Old_age   Always       -       47 (0 17 0 0)

195 Hardware_ECC_Recovered  0x001a   065   055   000    Old_age   Always       -       205401533

197 Current_Pending_Sector  0x0012   001   001   000    Old_age   Always       -       33601

198 Offline_Uncorrectable   0x0010   001   001   000    Old_age   Offline      -       33601

199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0

202 TA_Increase_Count       0x0032   100   253   000    Old_age   Always       -       0

SMART Error Log Version: 1

ATA Error Count: 1309 (device log contains only the most recent five errors)

        CR = Command Register [HEX]

        FR = Features Register [HEX]

        SC = Sector Count Register [HEX]

        SN = Sector Number Register [HEX]

        CL = Cylinder Low Register [HEX]

        CH = Cylinder High Register [HEX]

        DH = Device/Head Register [HEX]

        DC = Device Command Register [HEX]

        ER = Error register [HEX]

        ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1309 occurred at disk power-on lifetime: 5758 hours (239 days + 22 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 24 09 b2 e0  Error: UNC at LBA = 0x00b20924 = 11667748

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 08 21 09 b2 e0 00      07:04:21.280  READ DMA EXT

  27 00 00 00 00 00 e0 00      07:04:19.320  READ NATIVE MAX ADDRESS EXT

  ec 00 00 00 00 00 a0 00      07:04:19.263  IDENTIFY DEVICE

  ef 03 46 00 00 00 a0 00      07:04:19.262  SET FEATURES [Set transfer mode]

  27 00 00 00 00 00 e0 00      07:04:23.467  READ NATIVE MAX ADDRESS EXT

Error 1308 occurred at disk power-on lifetime: 5758 hours (239 days + 22 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 24 09 b2 e0  Error: UNC at LBA = 0x00b20924 = 11667748

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 08 21 09 b2 e0 00      07:04:21.280  READ DMA EXT

  27 00 00 00 00 00 e0 00      07:04:19.320  READ NATIVE MAX ADDRESS EXT

  ec 00 00 00 00 00 a0 00      07:04:19.263  IDENTIFY DEVICE

  ef 03 46 00 00 00 a0 00      07:04:19.262  SET FEATURES [Set transfer mode]

  27 00 00 00 00 00 e0 00      07:04:19.262  READ NATIVE MAX ADDRESS EXT

Error 1307 occurred at disk power-on lifetime: 5758 hours (239 days + 22 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 24 09 b2 e0  Error: UNC at LBA = 0x00b20924 = 11667748

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 08 21 09 b2 e0 00      07:04:15.107  READ DMA EXT

  27 00 00 00 00 00 e0 00      07:04:19.320  READ NATIVE MAX ADDRESS EXT

  ec 00 00 00 00 00 a0 00      07:04:19.263  IDENTIFY DEVICE

  ef 03 46 00 00 00 a0 00      07:04:19.262  SET FEATURES [Set transfer mode]

  27 00 00 00 00 00 e0 00      07:04:19.262  READ NATIVE MAX ADDRESS EXT

Error 1306 occurred at disk power-on lifetime: 5758 hours (239 days + 22 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 24 09 b2 e0  Error: UNC at LBA = 0x00b20924 = 11667748

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 08 21 09 b2 e0 00      07:04:15.107  READ DMA EXT

  27 00 00 00 00 00 e0 00      07:04:15.107  READ NATIVE MAX ADDRESS EXT

  ec 00 00 00 00 00 a0 00      07:04:15.049  IDENTIFY DEVICE

  ef 03 46 00 00 00 a0 00      07:04:15.048  SET FEATURES [Set transfer mode]

  27 00 00 00 00 00 e0 00      07:04:13.079  READ NATIVE MAX ADDRESS EXT

Error 1305 occurred at disk power-on lifetime: 5758 hours (239 days + 22 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 24 09 b2 e0  Error: UNC at LBA = 0x00b20924 = 11667748

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 08 21 09 b2 e0 00      07:04:15.107  READ DMA EXT

  27 00 00 00 00 00 e0 00      07:04:15.107  READ NATIVE MAX ADDRESS EXT

  ec 00 00 00 00 00 a0 00      07:04:15.049  IDENTIFY DEVICE

  ef 03 46 00 00 00 a0 00      07:04:15.048  SET FEATURES [Set transfer mode]

  27 00 00 00 00 00 e0 00      07:04:13.079  READ NATIVE MAX ADDRESS EXT

SMART Self-test log structure revision number 1

SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

```

And sdb:

```
# smartctl -a /dev/sdb                                         smartctl version 5.38 [x86_64-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen

Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===

Model Family:     Seagate Barracuda ES

Device Model:     ST3750640NS

Serial Number:    5QD05T8S

Firmware Version: 3.AEG

User Capacity:    750,156,374,016 bytes

Device is:        In smartctl database [for details use: -P show]

ATA Version is:   7

ATA Standard is:  Exact ATA specification draft version not indicated

Local Time is:    Sun Dec  6 18:47:09 2009 CET

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status:  (0x82) Offline data collection activity

                                        was completed without error.

                                        Auto Offline Data Collection: Enabled.

Self-test execution status:      (   0) The previous self-test routine completed

                                        without error or no self-test has ever 

                                        been run.

Total time to complete Offline 

data collection:                 ( 430) seconds.

Offline data collection

capabilities:                    (0x5b) SMART execute Offline immediate.

                                        Auto Offline data collection on/off support.

                                        Suspend Offline collection upon new

                                        command.

                                        Offline surface scan supported.

                                        Self-test supported.

                                        No Conveyance Self-test supported.

                                        Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

                                        power-saving mode.

                                        Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

                                        General Purpose Logging supported.

Short self-test routine 

recommended polling time:        (   1) minutes.

Extended self-test routine

recommended polling time:        ( 202) minutes.

SCT capabilities:              (0x003d) SCT Status supported.

                                        SCT Feature Control supported.

                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000f   114   095   006    Pre-fail  Always       -       0

  3 Spin_Up_Time            0x0003   093   093   000    Pre-fail  Always       -       0

  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       59

  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0

  7 Seek_Error_Rate         0x000f   084   060   030    Pre-fail  Always       -       250942611

  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1466

 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0

 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       96

187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0

189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0

190 Airflow_Temperature_Cel 0x0022   056   047   045    Old_age   Always       -       44 (Lifetime Min/Max 23/46)

194 Temperature_Celsius     0x0022   044   053   000    Old_age   Always       -       44 (0 17 0 0)

195 Hardware_ECC_Recovered  0x001a   063   051   000    Old_age   Always       -       6

197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0

202 TA_Increase_Count       0x0032   100   253   000    Old_age   Always       -       0

SMART Error Log Version: 1

No Errors Logged

SMART Self-test log structure revision number 1

SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

```

So, I'm relatively certain that a controller on the mobo itself is overheating: I opened the case, and note that the problems I'm having are much less. I had no problems last night, when the room was relatively cold. But perhaps there's something in this data above that tells you more about what's going wrong?

Thanks,

EE

----------

## Sysa

Unfortunately your sda is dead:

 *ExecutorElassus wrote:*   

> /dev/sda:
> 
> ```
> # smartctl -a /dev/sda
> 
> ...

 

Of course, you can try to low level format it but I think it is hopeless. Replace it ASAP but keep in mind that some data maybe wrong now (you have too many Current_Pending_Sectors).

BTW: You did not answer about your PSU... maybe it is too weak for your system? Try to evaluate and replace it first - maybe proper PSU solves you problem.

Also I have to mentioned that sda has worse cooling conditions then sdb - try to increase the gap or add an extra fan.

Do not forget to test HDDs! (man smartctl for -t option)

----------

## ExecutorElassus

PSU is 1kW. ie, 1,000 watts.

when this problem first hit, I ran fsck on it, and it found a LOT of wrong inodes. I moved them all and then rebooted right in the middle of the process. That happened a couple times. Could that be the reason for all the pending sectors?

Anyway, I'll start backing up. Thanks.

EE

----------

## ExecutorElassus

Also, on a side: the drives are both Seagate Barracudas, bought in 2007. They should still be operational. Anyway, time to start backing up  :Sad: 

Thanks,

EE

----------

## Sysa

 *ExecutorElassus wrote:*   

> Also, on a side: the drives are both Seagate Barracudas, bought in 2007. They should still be operational. Anyway, time to start backing up 
> 
> Thanks,
> 
> EE

 

FYI: Purchase time means nothing  :Smile:  - it looks like your sda surface is bad (maybe the disk was dropped) and relocation table is full (Reallocated_Sector_Ct ... 133 ), so controller can't relocate bad sectors any more. If the HDD warranty is valid, try to replace it ASAP (SMART info should be enough to prove the disk damage).

----------

