# Hard drive crash...need help interpreting error logs

## drvik

Hi everyone,

I've been having random lock ups of one of my hard drives... I have a pair of Western Dig 500Gb drives that are in a RAID 0 config....I used to have my root on a striped partition but was finding that when one drive locked up the whole system went with it, so I reinstalled my gentoo on a separate drive hoping that that would allow me to discover the HD problems when they occurred... the lock ups USED to happen once a day, but since moving to a seperate drive it's only happened twice now... for a while I was thinking that it was a software problem in some way....

The system is a server that runs 24/7... occasionally I return to the computer to do some work and find that all partitions on the drives are inaccessible... unable to be remounted... rebooting the computer does not fix this because one of the drives is missing from the BIOS... I need to switch off the power for a short while and reboot that way.... when I do this there are no problems until the next crash... running the Western Dig diagostic software over both drives comes up alright...  I installed smartmontools to try and "capture" the problem....

here's the relevant section of dmesg when the drive is inaccessible...

 *Quote:*   

> ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
> 
> ata2.00: (BMDMA stat 0x20)
> 
> ata2.00: tag 0 cmd 0x25 Emask 0x1 stat 0x41 err 0x4 (device error)
> ...

 

and a second crash

 *Quote:*   

> ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> 
> ata2.00: (BMDMA stat 0x21)
> 
> ata2.00: tag 0 cmd 0x35 Emask 0x4 stat 0x40 err 0x0 (timeout)
> ...

 

here is the output of smartctl -a /dev/sda

 *Quote:*   

> bigfoot ~ # smartctl -d ata -a /dev/sda
> 
> smartctl version 5.36 [x86_64-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
> 
> Home page is http://smartmontools.sourceforge.net/
> ...

 

I've been doing nightly smartctl self tests in order to try and find the problem... the above messages tell me that the drive has basically gone offline and nothing more... can anyone find any other meaning in the error logs?? The other drive comes up squeeky clean... same model, same firmware, same build date, same retailer...  swapping SATA ports on the motherboard and changing cables has not made any difference....

Any opinions?  It looks like I will be asking for a replacement drive... (I should also mention that the first 2 of this model I bought resulted in 1 DOA and the other died within a few hours of use)

drvik

----------

## neonman

What motherboard and sata controller do you have?

I've got a very similar problem with my WD drives (not in AHCI mode) running on an Abit AB9 pro (ICH8 controller).

I've made a post about it, but no real help there yet. You could search for my post if you want (search for: "Abit AB9").

----------

## zbrozlo

 *Quote:*   

> 
> 
> The system is a server that runs 24/7... occasionally I return to the computer to do some work and find that all partitions on the drives are inaccessible... unable to be remounted... rebooting the computer does not fix this because one of the drives is missing from the BIOS... I need to switch off the power for a short while and reboot that way.... when I do this there are no problems until the next crash... running the Western Dig diagostic software over both drives comes up alright... I installed smartmontools to try and "capture" the problem....
> 
> here's the relevant section of dmesg when the drive is inaccessible... 
> ...

 

What does it mean that " drive is missing from the BIOS (..) when I do this there are no problems until next crash" ?

It means that short after crash your drive is not available in BIOS, and after 'short while' becomes available ?

For me, it looks like power-related issue.. could You give some info about you machine config ? Maybe power adaptor is near-dead ..

I`m not specialist, but second thing I would check if I were you, is if you have proper memory access latencies set and all DMA related stuff, which you can change in your BIOS ( I had similar lockup problem few years ago with slow SEAGATE disk, and the problem was too low memory access latencies set ). 

 *Quote:*   

> 
> 
> Any opinions? It looks like I will be asking for a replacement drive... (I should also mention that the first 2 of this model I bought resulted in 1 DOA and the other died within a few hours of use) 

 

yeah, it definetely look like power problem.

----------

## VoVaN

I have _exactly_ the same problem, but with  Maxtor 6V250F0 drive. The problem occurs randomly and it's very difficult to debug...

I even have a post about that: https://forums.gentoo.org/viewtopic-t-512849-highlight-.html

----------

## drvik

neonman,

I have a DFI Expert motherboard with a nvidia 4 SATA and Silicon3114 controllers on board... the problem was occuring on the nvidia controller....

zbrozlo,

after the drive becomes unusable and I perform a linux reboot, when the messages come up in the BIOS POST, it take a little longer to go through the drive checking part and it fails to find the affected drive... if I press the power button to switch off the machine and switch it back on again, a nasty clicking sound comes from the drive with no BIOS post.... I have to resort to switching off the power supply, waiting a moment and switching it back on... then the computer powers itself on correctly....

my power supply is a new Antec NeoHE 500W.... it's been on the system since february... all my DMA settings appear correct, my memory checks out ok in memtet86+ v1.65...

my only suspicion is that it's the nvidia sata (i changed to new cables too)... the problem has not really occurred since switching to the silicon 3114 controller....

drvik

----------

## VoVaN

I think it's chipset related. New kernels include enhanced SATA support for _some_ chipsets and in my opinion this is the cause fo the problem. Something goes wrong there...

----------

## blue_american

Hi everyone,

I have the exact same problem on a Intel ICH7.

When it frozes the only thing that I can do is move the mouse pointer arround, and sometimes happens at the boot after loading libata, I think.

I've seen some threads about this problem and to unplug the other devices is not an option to me.

There's got to be a way out of this, because I've have other similar machine (same motherboard, disks, etc) running kubuntu and is not an issue.

Is there anyone who solved this problem? Please tell me.

Thanks.

----------

## pathfinder

hi guys,

sorry i don t have any answer... i m pissed off too.

same problem.

after a global system update (world uavDN)...

everything freezes, can t reboot, bla bla bla

my opinion: I am posting from a live distro and everything just works fine... so, is it problem related?!

reiserfs progs utility checked and internal tree,semantic tree are ok. nothing is corrupted...

but i can t get why the hell i annot access to some directories in my tree (backups parts)

hope we ll solve this.

I was a little bit worried after updating world when i saw gcc was being updated too.

i have no idea, but udev was dodgy too, and it seems it affects all the filesystems.

HEEEELP

----------

## blue_american

I'm positive this is a kernel related problem. Try gentoo-sources-2.6.19-r3, I think the problem will go away.

----------

## pathfinder

well, i deleted the /etc/udev/rules.d/00-ipod...   and it s working now... weird, i m still worried actually

i ve backed up all the data... just in case....

----------

## www.datadiscount.fr

Hi,

I am looking for a WD5000KS with the same firmware as yours (06.02E06).

Do you still have the drive ?

Are you interested by selling the drive to me ?

best regards

serge

----------

## drvik

No, I have had the drives replaced with the WD5000YS model... no problems since... I have also changed motherboard to a Asus Commando (P965 with ICH8R) and it's been great ever since...

drvik

----------

## ReDirEct__

i have the same problems... how do you solved?

This is my problem ----> https://forums.gentoo.org/viewtopic-t-696063-highlight-.html

----------

