# Hard drive causing system freeze? [solved]

## photomaskman

I run amd64 gentoo on an athlon 64.  I have two Seagate Barracuda hard drives.  One is 120G on /dev/sda and the other is 640G on /dev/sdb.  The kernel and OS are both on /dev/sda, /dev/sdb is mounted under /mnt and I put various large-volume, non-critical data on.  I also share /dev/sdb over NFS with the other computers on my home network as a file server.  

Recently I've started to have my computer freeze fairly often, maybe a couple or more times a week.  It appears to be totally frozen as even attempts to connect over ssh fail, I have found in the past that even if X is completely borked and I can't get to a command prompt in any way on the local machine, I can ssh over and kill X remotely.

Last night I was doing an "emerge -auDNv world" on the computer and I started hearing short, somewhat faint, buzzing sounds coming from the computer.  I have heard the sound before with failing Seagate hard drives.  After a few buzzes, my computer freezes again. It's totally unresponsive so I have to power down and when I come back up I'm greeted with unrepairable filesystem errors that I have to go and manually run fsck on (I'm still working through that).

My question is, is it possible from this to deduce which drive is going bad.  Can failure of a drive that contains no part of the OS (/dev/sdb in my case) cause a system freeze like this?  If so then the file system inconsistencies were caused by the system freezing while in the middle of an emerge.  Or is it my /dev/sda that's going bad?Last edited by photomaskman on Wed Aug 26, 2009 5:22 pm; edited 1 time in total

----------

## eccerr0r

Usually disk issues will show up in 'dmesg' if you're monitoring it.  Based on system failure, it's more like the primary disk with the OS having issues.

Another way is to see if your disk's SMART information contains useful data.  Emerge smartmontools and see if there are any logs on the disk error buffer (smartctl -a /dev/sda)

Are you using journalling filesystems?  Odd that they report unrecoverable errors... would likely point to disks.

Anyway whenever a disk fails I don't usually see sudden total system failure, usually one app dies, then another, then another as they all try to request data from the dead disk, until *all* apps are waiting for the dead disk, then the machine is gone.  Is this the behavior you're seeing?   Also you should pretty much always be able to ping the machine over the net if a disk fails...  if not, you likely have other issues...

----------

## cyrillic

Yeah, if ping and ssh can no longer reach the machine, then I would be looking for RAM / Motherboard / Power Supply problems before I would suspect a failing harddrive.

----------

## photomaskman

The freeze is sudden and everything freezes all together.  I'll try and run some diagnostics on my memory tonight.  For the file system errors.  When I got the "enter root password for maintenance" prompt during startup, I first tried running fsck in manual mode, then finally gave up because I was getting hundreds of prompts.  I tried booting again and instead ran 

```
#fsck -y /dev/sdb 
```

and was able to get up and running again.  There are no errors reported by smartmontools when I use "smartctl -a /dev/sdb" on my hard drives, but I have initiated long offlines tests 

```
#smartctl -t long /dev/sdb
```

I'll see if that brings up anything when I check the results tomorrow.

----------

## photomaskman

running the smartctl test mentioned in the last test gave me an error.  Retrieving the results gave me the following output.

```
#smartctll -l selftest /dev/sda

=== START OF READ SMART DATA SECTION ===

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline    Completed: read failure       90%     34514         6322758

```

I followed the smarmontools howto for dealing with bad blocks found here

http://smartmontools.sourceforge.net/badblockhowto.html

Then ran the long test again.  This time the test got past the block I reassigned and I got an error on a block with a larger address.  I have now reassigned that block and am running the smartctl long test again.

BTW memtest86+ finished and reported no errors.

----------

## photomaskman

I found the problem to be the secondary hard drive that contains no part of the operating system.  I disconnected this hard drive and had no problems for a couple months.  When i reconnected it, I had trouble.  Luckily it's still under warranty, so i sent it back for replacement.  But I found out that on my system, a non-OS hard drive can definitely cause sudden total system freezes.

----------

