# [solved] corrupt files

## dtjohnst

I have a strange problem. I currently have Gentoo installed on /dev/sdc. I want to install a new copy on a RAID1. So I emerged mdadm, partitioned sda and sdd (I used type fd), then created my arrays and filesystems. I then mounted them and downloaded the stage3. When I went to untar it, I received a bzip2 error about an incomplete file. So I redownloaded it and checked the md5sum aginst the digest and it failed. So I redownloaded from a different mirror, same problem.

Step 2, I downloaded the files onto a USB key on another machine to try and isolate the problem. The files passed the md5sum check. So I popped the USB key into my server (the one I'm trying to install on) and cp'd the files over. During the untar, I got the bzip2 error again. So I checked the md5sum on the usb key, and suddenly it failed as well. So I thought I must have screwed up somewhere. So I put the USB key back in the desktop, redownloaded the files, and verified the md5. Came back good. Moved the USB key to the server, checked the md5, passed. Copied them to my new md root, checked the md5, failed. Checked the md5 on the USB key, failed.

So somehow, copying the file corrupts it on the original source, which doesn't make sense to me. I've tried this with my root filesystem as ext3, ext4 and xfs with the same result. Is it possible there's something wrong with mdadm? Should I re-emerge it? Or is there something else I'm missing?Last edited by dtjohnst on Thu Dec 03, 2009 1:23 am; edited 1 time in total

----------

## NeddySeagoon

dtjohnst,

I suspect a hardware error.  Try memtest from the liveCD for a few hours.

If memtest reports errors, it does not aways mean its a RAM issue.  The tests use your CPU, RAM and motherboard. 

A hardware error could cause unexpected writes.

----------

## dtjohnst

I had a previous problem and had run memtest for about 4 hours with no errors reported. Turned out to be a power problem, which is why the RAM didn't report errors I don't think. I was only about 10W shy.

----------

## NeddySeagoon

dtjohnst,

memtest does not thrash your PSU ans I presume this error is not related.

Run memtest again

----------

## LesCoke

I second the ram.  I had a system that would run memtest86(+) for many days without failure.  But as soon as I started manipulating files larger than about 20 MB, errors would occur (detected file differences using md5sum).  To this day, I keep md5 / sha1 hashes of all my large archived media files because of that experience.  The problem would only occur when the ram was being hit using dma.  I finally swapped out the memory sticks one at a time to identify which one it was (fortunately I had identical spares).

I wouldn't rule out the hard drive either.  Use smartctl (smartmontools) to verify the drive(s) are not creating pending / bad sectors.  I always burn in a new drive writing and verifying several patterns before putting it into service.  (shred will simply write patterns, but does not verify, I use a custom script that uses badblocks to write and verify).

Your problem, if it hasn't been isolated to the disk subsystem, could be network card related.  I had an old SoHo Macronix card that would suddenly drop connections every time I used scp to transfer a large file, but it would stay up with a plain ssh terminal session.  Swapped the card with another and the problem went away.

Easiest way to troubleshoot such a problem is to find a repeatable test case that generates the errors, and swap hardware one thing at a time until you find the problem component.

It is strange that you only notice the problem once the raid array was established.   It is also strange that your USB key /thumb drive would get corrupted during the copy.  You didn't say if you rechecked the md5sum on the other machine after seeing the failure on the server before downloading a new copy.   With buffered file-systems and plenty of RAM, a file will not necessarily be re-read from the actual disk, the copy still buffered in RAM will be used to satisfy the repeated reads.  This brings me back to a RAM problem.  Memtest keeps the RAM busy during it's tests, but a running but idle system will have most of the code and data that is needed by the idle process in cache leaving the RAM largely idle;  This is when weak RAM bits will rear their heads.

LesLast edited by LesCoke on Fri Nov 27, 2009 10:05 am; edited 1 time in total

----------

## qubix

try to put a file at least 2 times larger than the amount of ram that you have on your box on the drive you suspect to have problems - even on each one of them. Than do md5sum of the same file a number of times one after another. If you will get different results each run with no complaints in dmesg, that means that your mainboard might be toast. 

It's important for your file to be bigger than ram. If it's smaller, the kernel will cache it and every run of md5sum will use the memory not the data from the drive.

for confirmation do the same on the same machine but use USB storage.

Do you have fujitsu servers? I've had that problem often there.... each box was replaced under warranty.

btw. did you check dmesg for strange messages?

----------

## dtjohnst

Sorry it took me so long to reply. Things have been a little hectic.

As you said NeddySeagoon, the PSU shouldn't be affected by a memtest, which is why I assumed if it passed memtest then, it would pass memtest now. Afterall, I only had power problems if my PC tried to access 2 drives at the same time with a USB stick, keyboard and mouse plugged in (I imagine I was only a few W shy). However, when I reran the memtest, it did fail this time. Last time I ran it for 4 hours without an issue. This time my PC reset after 2. I relaxed the timings a bit and then it failed after about 20 mins. I relaxed the timings a bit more and it failed after about 3 mins. At that point the system failed to POST. I tried several timings including BIOS defaults to no effect. I tried both sticks of RAM with the same result. So I ordered a different manufacturer and it works fine now. POSTS fine, memetest ran for 8 hours without an hour (I left it overnight) and without rebooting or crashing, and my files no longer report corrupt. Thanks for your help.

----------

