# software raid, corruption, not problem w/ media

## exodist

first off some specs:

Pentium4 3.0 HT disabled

1gb ram

via onboard ide w/ 320 gb maxtor HD

via onboard sata w/ 1 250gb western digital hd

first silicon image sata controller, 4 western digital 250gb drives

second silicon image sata controller, 2 segate 300gb hd's

Gentoo up to date

What I have tried/problem, I will list what I have done, alternatives I have tried to fix it with, and problem. Extra debug info will follow:

I create a raid arrays, tied levels 1 and 5.Create and make filesystem up to this point works fine (tried both xfs and reiserfs, same prob occurs on both) Have tried several combinations of drives and controllers, all controllers/drives have same problem. I have tried both the 2.6.17 and 2.6.18 version kernels, no love.

Basically I create the raid, wait for it to resync, then I create the filesystem (xfs or reiserfs) then I copy a lot of files to it. I then try to delete some stuff/read some stuff, modify the drive in some way. Stuff fails or even segfaults requiring reboot, dmesg shows that io was trying to access way beyond the edge of the device numerous times saying it is going out to arbitrary block numbers that are often 10+ digits long. I run the filesystems scanning/repairing utility and it finds tons of files with incorrect sizes/lengths. usually this is it, but a few times on reiser it has found corruption requring the tree be rebuilt.

If I rpair the fs then scan it it is clean, I mount it, then unmount it and scan again, still clean, I try to make changes on the drive and once again get errors, scan says repair is needed, same drill as before it is all fixed, then scan says clean.

Once again I have tried different filesystems, different kernels, different raid builders (both mdadm and raidtools)

Filesystems on non raid partitions (including all partitions used in the raids) not corrupted after much use.

Problem as far as I can narrow it down to is in the raid array, not the filesystem or devices.

Added:

Forgot to mention, simply copying data from the drive also seems to ?cause? corruption.  Like I said I can repair with the fs repair tool, after that scanning will say it is good, then I mount it and unmount it and scan still says good, but then I start copying files from it and after the first few I will start to get read errors and dmesg gets the io access beyond end of device errors.  then scan finds the corruption... I have not tried mounting the fs read only, when this rebuild-tree is done I will do that (700gb raid takes a long time to repair)

some extra stuff:

Kernel When erros occurs (this is after repairing, scanning and finding it clean, then setting raid read-only and mounting read-only, then trying to copy data off of it)

```

md8: rw=0, want=4261343920, limit=1465175424

attempt to access beyond end of device

md8: rw=0, want=4261343920, limit=1465175424

Buffer I/O error on device md8, logical block 532667989

attempt to access beyond end of device

md8: rw=0, want=4261343920, limit=1465175424

Buffer I/O error on device md8, logical block 532667989

attempt to access beyond end of device

md8: rw=0, want=17166515696, limit=1465175424

attempt to access beyond end of device

md8: rw=0, want=17166515696, limit=1465175424

Buffer I/O error on device md8, logical block 2145814461

attempt to access beyond end of device

md8: rw=0, want=17166515696, limit=1465175424

Buffer I/O error on device md8, logical block 2145814461

attempt to access beyond end of device

md8: rw=0, want=18446744068902090496, limit=1465175424

attempt to access beyond end of device

md8: rw=0, want=18446744068902090496, limit=1465175424

Buffer I/O error on device md8, logical block 18446744073108618975

attempt to access beyond end of device

md8: rw=0, want=18446744068902090496, limit=1465175424

Buffer I/O error on device md8, logical block 18446744073108618975

attempt to access beyond end of device

md8: rw=0, want=4597359648, limit=1465175424

attempt to access beyond end of device

md8: rw=0, want=4597359648, limit=1465175424

Buffer I/O error on device md8, logical block 574669955

attempt to access beyond end of device

md8: rw=0, want=4597359648, limit=1465175424

Buffer I/O error on device md8, logical block 574669955

attempt to access beyond end of device

md8: rw=0, want=18446744070477053544, limit=1465175424

attempt to access beyond end of device

md8: rw=0, want=18446744070477053544, limit=1465175424

Buffer I/O error on device md8, logical block 18446744073305489356

attempt to access beyond end of device

md8: rw=0, want=18446744070477053544, limit=1465175424

Buffer I/O error on device md8, logical block 18446744073305489356

```

Here is a blotted out output from rsync when the errors occur

```

Luxor hd1 # rsync -aqP /Blotted/Out/Path/U* ./

rsync: read errors mapping "/Blotted/Out/file1.xxx": Input/output error (5)

rsync: read errors mapping "/Blotted/Out/file2.xxx": Input/output error (5)

rsync: read errors mapping "/Blotted/Out/file3.xxx": Input/output error (5)

rsync: read errors mapping "/Blotted/Out/file4.xxx": Input/output error (5)

```

----------

## BradN

Sounds like a bug with one of your controller cards or the driver for it.  I have a 5 disk raid array across 3 controllers (onboard IDE, onboard SATA, add-in promise IDE card), and what sometimes happens, mostly during high bandwidth transfers, is a DMA transfer locks up on one of the promise channels, and then linux sits there and waits about 10 seconds before resetting it, and then everything's fine again.  Interestingly enough, this only happens when it's simultaneously using the serial ATA ports.  I never saw any data corruption, but it wouldn't be a big stretch to imagine it happening with buggy/incompatable hardware or drivers.  Software raid itself is pretty stable.

Also, I should mention this is with a VIA k8t800 chipset for the onboard stuff, so maybe there's bugs in some VIA chipsets.

----------

## exodist

this happens on 2 different raid arrays, the first one is 4 of the same model drive all on a single silicon image sata controller

the other is 4 different model drives, 1 ide, 2 sata on a second silicon image controller, and the last one on the via onboard. both arrays have the same problem.

also the drives have performed for months in the same situation but not raided with no corruption, same install, same system, same controllers, etc.

----------

## BradN

Weird... have you tried updating the kernel version?

----------

## exodist

 *exodist wrote:*   

> (tried both xfs and reiserfs, same prob occurs on both) Have tried several combinations of drives and controllers, all controllers/drives have same problem. I have tried both the 2.6.17 and 2.6.18 version kernels, no love.
> 
> 

 

----------

## BradN

Weird.  I've used gentoo sources up to 2.6.17-r4, and haven't had any problems with raid-5 over three controllers (just DMA transfers lock up occasionally, but it's just a nuisance really).  You should do some tests with simultaneous I/O (to the raw devices, not the raid array) - like, run a badblocks on each drive and see if it screws up just reading data.  It only takes like a minute for the DMA issue to pop up here.

----------

## richfish

 *exodist wrote:*   

> this happens on 2 different raid arrays, the first one is 4 of the same model drive all on a single silicon image sata controller
> 
> the other is 4 different model drives, 1 ide, 2 sata on a second silicon image controller, and the last one on the via onboard. both arrays have the same problem.
> 
> also the drives have performed for months in the same situation but not raided with no corruption, same install, same system, same controllers, etc.

 

So the silicon image controller is common to both raid arrays?  What happens if you build an array without using that controller?

Another thing to try is drop your memory timings down a notch or two.  It's easy to do, just go into the BIOS and add 1 to each of the CAS, RAS, etc settings.  Memory timings that are too tight can easily show up as semi-random DMA IO corruption.

-Richard

----------

## exodist

there are 2 silicon image controllers, one is a few months old, the other is brand new.  I have to mess w/ some stuff before I can try it without using any silicon image controllers.

----------

## BradN

One of the nice advantages of software raid... you can switch controllers if there's problems  :Smile: 

----------

## exodist

yeah, unfortunately at this point my data is probably not recoverable, my backups only cover 75% of my data... my mp3's seem to be playing fine, but I have to many to check (all legal)   but my videos (also all legal, no mpaa movies) seem to have corruption when they play (blocky, colors blurs on occasion)

This is with it mounted read-only on a read-only and the device read-only

----------

## BradN

Damn, that sucks.

----------

## exodist

hmm, now I think there is definately something to your dma/speed stuff.  When I tested playing the video's and they had corruption I had 4 rsync's going to copy data from the raid to some non-raided drives, about an hour into it (after I had left it alone) there was a kernel panic from the raid drivers and I needed to restart.  I once again verified the filsystem with a check, then made the raid device read only, then mounted it.  I tried playing the video's with nothing else running w/ the drive and there is no longer any apparent corruption.......

I am still copying everything to non-raided drives, but only one rsync at a time, it does give me io errors as mentioned above, but the files seem to copy fine regardless of that.

I suspect it has to do with the main board or cpu, I have experienced some other minor flaky behavior on it before, primarily the clock seems to be slow, after a day the clock is 15 seconds slow, after a few days it is off by a minute or 2.

I seem to remember having some issues with it a long while back, but until recently it had been a myth-tv box, that largely changed because the clock problems screwed up the recordings and scheduling.

----------

## BradN

My board has a wacky clock generator chip... it's supposed to let you set the FSB to like 250, but the chip physically won't generate a signal above 214MHz, despite what the datasheet says.  I actually got a guy working on a linux FSB clock setting utility to add support for that chip, and it confirmed why my BIOS would lock up and revert to old settings for clock speeds above 214 - the chip maxed out at 214 and the BIOS waited for it to reach the desired speed and never got it.

----------

## exodist

well I put a different board in it, I now have 8x 250gb partitions in a raid5 w/ no problems.  1.6tb formatted w/ xfs  :Very Happy: 

----------

