# Undetected (hard-disk?) error - [Solved - was RAM]

## Akkara

Today I discovered a undetected (by the system) error in a file.

I had taken backups, and, being my paranoid self, checked it all with md5sum.

One file didn't match.  One bit was off.  Unmounted and remounted, checked that file again, and indeed it was bad.  Re-coying that one file fixed it.  So somewhere between reading the source and writing the destination, one bit got flipped.

There is nothing in /var/log/messages indicating any sort of error.

The machine otherwise runs fine.  It has seen a total of approximately 20TB of backup traffic in its current configuration, which was recently upgraded to 8GB RAM on a asus-P5KC mobo, running ~8% slow (DDR800 ram running at DDR732).

I'm very surprised there was nothing in /var/log/messages.  Any idea what might have gone wrong?  Could the data had gotten corrupted while sitting in memory?  Or perhaps while being DMA'ed from the sata controller to RAM?  Would raid have been any help, given that there didn't seem to be any warnings from the kernel?

I'm at a loss here.  I caught it - this time.  But whatever happened, what is to say it couldn't happen while processing the data rathar than just backing it up, where I don't have a checksum to match against.  Maybe I need to look into ECC RAM (are there mobos that support that for not too much $$)?  And what about the I/O devices themselves - is there any checking that the transfer actually completed correctly?  Is there something in the kernel that I can enable to better detect these sorts of things?

Edit/addition: kernel is 2.6.22-gentoo-r2Last edited by Akkara on Sun Oct 07, 2007 7:54 am; edited 1 time in total

----------

## eccerr0r

If you don't have:

ECC / Parity RAM

Chipset/Motherboard that supports the above

IO with CRC/parity/ECC protection

CPU that supports protection on internal registers.

then you can bet that sometime in their usable lifetime you'll see an unexpected bit flip.

Silent Data Corruption is a reality - looks like you're a victim of it.  You should memtest86 your RAM to make sure it was not at fault - whenever I had issues with corruption it usually came down to that.  I've had bad IO controllers (bad IDE controller) that corrupted data during transfer to disk as another source of data destruction.  And cosmic ray strikes to CPUs and RAM is yet another source of corruption that cannot really be debugged.  The only way to _reduce_ SDC is redundancy features.

Unfortunately a lot of these "features" cost.  And sometimes cost quite a bit because it is additional bits of RAM and wires (but most of it is due to it being "special purpose" and have that premium).  For the most part, I'm paranoid against errors because I've had so many of them in the past.  I end up md5sum/checksum everything that potentially could be corrupted (compressed files have checksums/crc's built in so I usually don't re-check those).  Yes it costs time, but there's no way around it for the machines I have.

If you did not find any IO or memory bad spots, and a CPU check turns out good, then you'll have to resort to "better" hardware, one that has reliability features.  Unfortunately x86 cpus tend to not have these features and you'll have to resort to "real" server class cpu/computers that have some protection against random corruptions due to radiation.

----------

## Akkara

I'm about to lose my mind.

The random errors continue.  Every few terabytes, there's a random bit-flip.  Decided to test this out in more detail by repeatedly copying a full 750GB disk to another, and testing checksums.  It is repeatable, but random.

Nothing in /var/log/messages.  Nothing at all indicating any issues.  No random crashes.  Memtest passes overnight.  Try different disks, same thing.

I'm suspecting it might be the sata chip.  The sata cable itself has parity - it should tell me if I'm getting data errors on the cable...no?

P5KC mobo, Intel e6850 core2 duo, seagate barracudas 7200.10, not overclocking anything.

Anyone else seen random errors like this?  They are rare enough that they'd escape careful notice - with is the worst kind there can be - you think it's working but it isn't.  Completely unacceptable.  Is there an answer, short of 4x the $$ for server-class hardware?  Do I just have a bum controller or something?

----------

## HeissFuss

 *Quote:*   

> recently upgraded to 8GB RAM

 

As eccerr0r mentioned, you should check your RAM if you haven't already.  That's the most likely cause of the corruption, especially considering that you've recently added some.  The error would be somewhere in the free memory (where filesystem would buffer before writes.)  You can probably check that online with memtester http://pyropus.ca/software/memtester/ (also in portage), or boot the server to a LiveCD and use memtest86.

----------

## eccerr0r

If you're paranoid of errors the only thing you can do is to get hardware with redundancy.  I'm not surprised at all there's no errors detected by Linux - how could it detect it in the first place?  Most of the stuff going through our computers is _DATA_ not _CODE_ - Code corruption the computer can detect with a crash.  DATA corruption, the computer has no idea what it's dealing with so it will faithfully copy bad data as if nothing ever happened.

You should at least try other controllers, memories, downclocking CPUs below rated speed, downclocking FSB and ram further, to possibly DDR667 or lower speeds, better cooling (CPU and NB), better PSU, etc., etc., and see if it's any better.  Does it happen with less RAM in the system - is it just after the upgrade you noticed failures?  What RAM tests?  Suspect motherboard too?

As time goes on and people complain more and more about them I suppose manufacturers will start adding server-class features to _help_ protect against these weird occurrences.  Even server class features are not infallible.

I've had these things happen before - if it really is critical I always md5sum everything just to make sure even if it costs extra time, after being bitten by bad RAM and bad disk controllers.  Recently my hardware has been behaving well but I'm not going through terabytes of data every day...

----------

## Akkara

Update:  After a hair-pulling night and day, I think I narrowed it down:

It happens whenever I'm reading from one disk and writing to another.

If I restrict activity to one disk at a time there's no problem.

Kernel is 2.6.22-gentoo-r8, 64-bit.

I'm starting to suspect a broken mobo sata controller or possibly DMA, although the problem still occurs when one disk is off any of the four Intel ICH9 ports and the other disk is off the 5th Jmicron esata port.  Maybe try to scrounge up a pci sata card and test it with that next.

I probably should have returned this mobo a long time ago - when I first got it, the machine crashed hard whenever I'd access disk and on-board network simultaneously.  (I was told it is a known problem with the driver for that net.)  Foolishly I opted for using a pci network card since that seemed to work.  Figures there's now a problem with simultaneous disk IO.

[Edit: removed emotional ill-founded speculation]Last edited by Akkara on Sat Oct 06, 2007 11:15 am; edited 2 times in total

----------

## eccerr0r

I don't know what power supply you're using, it can also cause strange problems to manifest when disks are used.  Just another thing to watch out for...

----------

## Akkara

It's a new power supply - Thermaltake toughpower 650 (the machine itself pulls ~120W normally).

Can anyone else with a asus P5KC mobo confirm whether copying one large disk to another (I used cp -a) results in a correct copy?  I'm trying to isolate whether something's bad with the hardware here, or whether it might be a driver issue (apparently, the ICH9 drivers are rathar new as well)

 :Sad: 

----------

## Akkara

Update:

 *Quote:*   

> [try memtester (http://pyropus.ca/software/memtester/)  [...] it's in portage]

 

Thank you VERY much for this suggestion.

It found something.

Even at DDR667, there's a marginal location in one stick, relatively quickly found, that several long runs of memtest86+-1.7 never caught.

Fortunately still within the return/exchange time, I'll know soon whether that fixes it conclusively.

Thanks again!

----------

## eccerr0r

What test did it fail on in memtester?

I found that the memtest86+ tests don't do much more than the BIOS test, until test 5 - the block move - which has found pretty much all the errors I've ever seen.  I'm tempted to hack a version of memtest86+ to not bother with the first four tests or at least move them to the end of the cycle...

----------

## Akkara

 *Quote:*   

> What test did it fail on in memtester?

 

It varied.  Sometimes it would fail just a few bits into the 'Stuck Address' test, sometimes on the 'Random Value', and occasionally not until one of the 'Compare OP' tests.  Also the failing address was different with each test (although that could also be an artifact of getting different virt->phys mappings).

Strange problem though!  I wonder why copies within the same disk worked when in fact the memory was bad.

I got the RAM exchanged and so far the new memory has passed ~4 hours of memtester.  I'll let that run overnight and then try a large disk copy at next opportunity and report back how that turns out.  Meanwhile I'll mark this solved for now.

Thanks again!!  <3 these forums  :Smile: 

===

Summary for people scrolling to the bottom post for the answer:

- Intermittent problem, observed only when copying large amounts of data from one drive to another

- The usual 1st line of 'does it work' tests, including memtest86+-1.7, didn't show anything bad

- Turned out is was the RAM anyway

- 'memtester' (in portage) seems to find subtle memory issues faster and better.

----------

