# can a dying hd cause kernel panic and disksleep?

## DaggyStyle

I have an old 80gb hd I think it is 4-6 years old, until yesterday it was the boot disk of xp.

I've installed gentoo and placed boot, /usr/portage and var on in among other things. 

when I try to run emerge it freezes when all sub childs are in disk sleep, I cannot kill them.

also once it caused me to kernel panic.

can that means the hd is failing?

also, after that it won't shutdown cleanly

----------

## eccerr0r

When I had disks die, I don't think it tends to panic the kernel.  Though I can potentially see it happening if it was trying to read swap and fail.  Most of the time it just causes disk i/o errors and for SATA/other hotpluggable disks, the disk will "disappear". and all i/o operations to the disk will fail instantly.

Disk sleep can happen if the disk is having trouble reading/writing.

So, plausible yes, but usually the issues are persistent through bootups, etc.  Run badblocks and check SMART information.

----------

## NeddySeagoon

DaggyStyle,

Badblocks is almost useless as hard drives remap bad blocks to spare sectors provided for the purpose.

When all the spares are used, the drive is as good as dead.

smartmontools will allow you to read the drive internal error log.

It could be a data cable fault.

----------

## eccerr0r

 *NeddySeagoon wrote:*   

> Badblocks is almost useless as hard drives remap bad blocks to spare sectors provided for the purpose.
> 
> When all the spares are used, the drive is as good as dead.
> 
> 

 

While the act of hard drives remapping bad sectors is true for all recent disks, remapping will NOT be done until the particular sector is being written.

If you run badblocks in its default, read-only mode, the kernel will be alerted of any unreadable sectors as ... unreadable.  It will NOT remap them at this time, there's no sense to remap as it can't read the bad block, any attempts at guessing would result in the dreaded "silent data corruption".

There's only a fixed number of spares per region of the disk, the disk may not be out of spares before it exposes bad sectors to the user.

----------

## NeddySeagoon

eccerr0r,

Remapping is done on read too, to ensure data is not lost as it gets harder to reconstruct the data. Its fairly essential for Partial Response Maximum Likelihood read systems.

----------

## DaggyStyle

I'm running the long test from debian livecd as I cannot install the program.

where will I find the log?

----------

## eccerr0r

However, if/when correctable errors do happen, it's transparent to user and no bad sectors are revealed to the user - most likely these will happen transparently and no one will know unless the firmware logs it.   One sector will be a short hiccup and things would keep on going.  But if there is a serious, user-noticeable degradation in speed, likely they are happening back to back (or frequently).  In this case where there are a lot of errors, usually a serious failure occurred (bearing failure, platter warp, servo failure, degaussed surface) - and likely there will be sectors that PRML and then ECC will not be able to recover, and an error will inevitably be exposed to the user.  Only when it gets this far the kernel will end up panicking as the disk can no longer return the correct data.

When the error rate becomes user noticeable, most likely something really bad is going on, and will eventually be sensed by badblocks, at least badblocks will trigger the drive to do the corrections it can - and quickly deplete the spare block reservoir.  (There would be no reason to put it in "pending" as it knows what block had a weak/correctable read - it should just rewrite it back right away to minimize possible data loss - one sector is fast to do this out of the large number of sectors being requested.)

To get the in-disk-electronics SMART log, you can use smartmontools:  smartctl -a /dev/harddiskdevice

Taking a look at dmesg information during usage can be useful to see if the kernel reported anything while trying to read the disk.

----------

## DaggyStyle

here is the output: http://pastebin.com/FLWPzRdm

on the one side, the drive passed, on the other side most of the values are pre-fail or old age. haven't thought of checking dmesg, what's wrong with me?

running badblocks now.

----------

## DaggyStyle

badblocks has ended without any output.

Ahhhh told you I've got a kernel crash!

thanks for the dmesg tip, here it is: http://pastebin.com/umdjaFAg

any hints?

----------

## eccerr0r

The SMART data looks OK, I think.

Your oops looks like reiserfs-related oops.  Might want to look to see if you can have a backup handy and reiserfsck the drive to see if you can recover any corruption -- corruption that which can cause disk waits and kernel oopses...

----------

## DaggyStyle

 *eccerr0r wrote:*   

> The SMART data looks OK, I think.
> 
> Your oops looks like reiserfs-related oops.  Might want to look to see if you can have a backup handy and reiserfsck the drive to see if you can recover any corruption -- corruption that which can cause disk waits and kernel oopses...

 

the strange part is that the install is a clean one, what could have caused it?

----------

## eccerr0r

While bad sectors can cause corruption, bad memory, overclocked cpu, bad motherboard, bad cables, etc. can cause it too.

Improper shutdowns is probably the biggest reason for them though.

So there are "can't read LBA address" errors in dmesg?  Probably can rule out physical media if there aren't any... though a poorly implemented hard drive can't be ruled out (I'd be surprised if WD would do silent data corruption when it can avoid it.)

----------

## DaggyStyle

 *eccerr0r wrote:*   

> While bad sectors can cause corruption, bad memory, overclocked cpu, bad motherboard, bad cables, etc. can cause it too.
> 
> Improper shutdowns is probably the biggest reason for them though.
> 
> So there are "can't read LBA address" errors in dmesg?  Probably can rule out physical media if there aren't any... though a poorly implemented hard drive can't be ruled out (I'd be surprised if WD would do silent data corruption when it can avoid it.)

 

sorry, I've lost you. can you explain more clearly?

----------

## eccerr0r

Bad sectors means the hard drive could not return the data that was written there initially - this is by definition, data corruption.  Just wanted to say this is BAD.  WD is a reputable company, and though returning bad data is unavoidable when a sector goes bad, not reporting that there was an error is "silent data corruption" -- which does not add to reliability.  When the hard disk *knows* that the data is bad, it should tell the kernel -- it failed ECC.  When it tells the kernel, the kernel should report an unreadable block "error on LBA address".  This should show up in dmesg as well.

But other reasons for corruption can be: bad memory (if you don't have ECC memory it makes it harder to detect), overclocked or bad CPU (1+1=3 -- corrupted), bad motherboard, bad cables, etc.  Lots of reason for corruption.

Improper shutdowns can also cause the issue.  Though journalling helps, it's possible to corrupt the journal as it's being written and this causes havoc.

----------

## DaggyStyle

 *eccerr0r wrote:*   

> Bad sectors means the hard drive could not return the data that was written there initially - this is by definition, data corruption.  Just wanted to say this is BAD.  WD is a reputable company, and though returning bad data is unavoidable when a sector goes bad, not reporting that there was an error is "silent data corruption" -- which does not add to reliability.  When the hard disk *knows* that the data is bad, it should tell the kernel -- it failed ECC.  When it tells the kernel, the kernel should report an unreadable block "error on LBA address".  This should show up in dmesg as well.
> 
> But other reasons for corruption can be: bad memory (if you don't have ECC memory it makes it harder to detect), overclocked or bad CPU (1+1=3 -- corrupted), bad motherboard, bad cables, etc.  Lots of reason for corruptio
> 
> Improper shutdowns can also cause the issue.  Though journalling helps, it's possible to corrupt the journal as it's being written and this causes havoc.

 

all the reasons above seems strange, cpu isn't ok, don't think that it is a mem,cpu or mb because the debian livecd works without any problems

what can damage a cable?

that comp was running xp for some time now.

maybe loose cable?

I'm running check disk now, will report

----------

## NeddySeagoon

DaggyStyle,

```
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0

197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
```

Are the two important values. Current_Pending_Sector is a count of the number of sectors needing to be remapped. In this case, 0.

The Reallocated_Event_Count is also good.

At 

```
Power_On_Hours          0x0032   086   086   000    Old_age   Always       -       10247
```

thats 10,247 operating hours, the drive should have a lot of hours let yet.

I think your drive is good so you need to look elsewhere.

Try the easy things first - swap the SATA data cable. Run memtest. Memtest uses the RAM, CPU and parts of the chipset. If it finds errors, it does not mean its a RAM problem.

Only try one thing at a time or if the problem goes away, you won't know what is was.

----------

## DaggyStyle

ok, partiton check is clean, running memtest

----------

## DaggyStyle

till now (one pass), no problems, the mb is gb so I don't think it is the fault, my money is on the cable, maybe it got loose

----------

## eccerr0r

 *DaggyStyle wrote:*   

> ok, partiton check is clean

 

Reiserfsck reports clean!??!?!

If that's true then something obscure is screwed up...

----------

## DaggyStyle

 *eccerr0r wrote:*   

>  *DaggyStyle wrote:*   ok, partiton check is clean 
> 
> Reiserfsck reports clean!??!?!
> 
> If that's true then something obscure is screwed up...

 

memtest reports 27 tests all passed ok.

time to check under the hood.

will report back.

----------

## DaggyStyle

ok, tried to switch between the ports to a working known port, not working, crashes again on sync.

will go and get tomarrow a new sata cable and try, hope it will work cause frankly, I'm out of ideas.

----------

## NeddySeagoon

DaggyStyle,

It can be PSU ... either the metal box with all the wires coming out or the Vcore PSU close the to CPU on the motherboard.

If it were the Vcore PSU, I would expect it to fail during memtest too.

Testing the PSU is very difficult - substitution is really the only way. If you can reduce the load on the PSU by pulling cards or drives, that may help too.

----------

## DaggyStyle

 *NeddySeagoon wrote:*   

> DaggyStyle,
> 
> It can be PSU ... either the metal box with all the wires coming out or the Vcore PSU close the to CPU on the motherboard.
> 
> If it were the Vcore PSU, I would expect it to fail during memtest too.
> ...

 

this is my psu: http://www.thermaltake.com/product_info.aspx?PARENT_CID=C_00000903&id=C_00000904&name=TR2+550W&ov=n&ovid=

it isn't 80% but I doubt it is the reason. metal box?

----------

## DaggyStyle

ok, I've went and got another cable, pluged it in and tried again.

this time, the computer hardlocked.

so I've reformated the drive again and synced the tree again.

this time it seem to work well.... crossing my fingers and hoping, will mark the thread as solved if all goes well.

----------

## NeddySeagoon

DaggyStyle,

Thermaltake have a good reputation. PSUs are commodity items, you get what you pay for.

That reminds me of a Victorian probverb.  *Quote:*   

> "If you want good quality oats, you must pay a fair price, if you want oats that have been passed through the horse ... "

 

Even with a good reputaion, everyone produces the odd dud.

There is another thing. Loot at the capacitors located close to the CPU on the motherboard.  The tops must be flat, not domed and they should all be standing straight. Lastly, look at the bases, where they contact the motherbaord. There must be no signs of the rubber bungs being pushed out, nor of the contents leaking.

Post a few images if you want me to look.

Reformatting the drive is unlikely to do anything.

----------

## DaggyStyle

 *NeddySeagoon wrote:*   

> DaggyStyle,
> 
> Thermaltake have a good reputation. PSUs are commodity items, you get what you pay for.
> 
> That reminds me of a Victorian probverb.  *Quote:*   "If you want good quality oats, you must pay a fair price, if you want oats that have been passed through the horse ... " 
> ...

 

I doubt that it is a mb issue...

where I live, it is all about generic parts. getting a part with high reputation costs alot...

I'll take alook on the mb and see if I can notice anything. but matter to the fact, I've been running packages install for a couple of hours and no crash yet.

----------

## DaggyStyle

ok, update time.

after a day after the reformat, the issue has returned, so I've wiped clean the partition and formated it in ext2 + 1K blocks

since then all is good.

I think it is a reiserfs bug or I did something that I wasn't suppose to do.

----------

