# LVM/RAID leads to crashes.

## shepmaster

Hey all:

After being burned by a drive going south, I decided to get myself situated with a better drive solution. I currently have 3 drives (250/250/120) that are round-robined RAID1. i.e. 60 GB from each 250 is mirrored on the 120, and the remainder of the 250s mirror each other. Each of these RAID partitioned are coupled into a LVM, so I can resize and shift as I like. I am using ext3 as the FS. Each physical drive is on its own IDE bus (hda/hdc/hde).

My problem comes when I try and access some big files, such as my MythTV recordings. It seems to work OK for accessing it on the box, at least skipping through it with VLC didn't crash. However, when I use NFS to stream to my Mac, or FTP to copy it over, something bad will happen.

Sometimes it will lock up (keyboard lights function, no input allowed, no ssh connectivity), sometimes I get the following:

attempt to access beyond end of device

dm-0: rw=0, want=27230863656, limit=209715200

I've also seen "INIT PANI: Segmentation Violation" (this last is from memory, may differ slightly.)

Please, someone help me!!!

I'll be happy to post whatever is needed.

-shep

----------

## shepmaster

I just tried to SCP one of the files to the mac, and it gets 156MB into it and then stops. /var/log/kernel reports the same message as above. I have unmounted the partition and ran fsck -fy, but no errors are found.

I just don't get it...

----------

## shepmaster

So, playing around with it some more, I can actually copy a file from the LVM/RAID to another mounted partition, apparently without troubles. However, when I try and SCP that file out to the Mac, I'll get a hard lock   :Shocked:  .

I do not see anything in the Macs logs, and obviously I cant see anything on the Gentoo machine, either.

Any thoughts  :Question:  ?

----------

## shepmaster

I can also play the files just fine, using mplayer. I played a few differing ones, including ones that cause the computer to lock up when being transferred.

So it looks like a network-type issue. Any help would be great!

----------

## Moriah

I am running LVM2 under 2.6.* kernels for nearly 2 years.  Lately (since around April 2006?) I occasionally see crashes -- actually load goes thru the roof and nothing can be launched, but running tasks keep running -- during nightly backup runs.  The primary reason for using LVM is to support snapshots to freeze the filesystem during these backups.  Some machines are running LVM2 on top of RAID-1 or RAID-5; others are just running on a single drive.

This never used to happen, but LVM2 or *SOMETHING* must have been updated in the past few months, and I suspect that the high volume of disk activity or network activity during backups when a snapshot is active might be the culpirt.  BTW I use rsync to move the backup sets over the network, tunnelled thru ssh.  The network is 100baseT thru a hub, and some nodes are thru an iptables firewall.  All raid arrays are UDMA IDE based, with only 1 drive per IDE bus.  CPU's vary from P-75 to AMD-64 3000, with memory from 256 MB to 1.5 GB.  I have seen this problem about once every 1 or 2 weeks on at least 1 machine, and with and without RAID, on any of these machines.  Drive capacities range from 10 GB to 250 GB, with RAID capacity from 250 GB RAID-1 to 500 GB RAID-5.

Has anybody else seen anything like this -- besides the previous posts to this thread, which are similar, but not really quite the same as my symptoms.    :Question: 

----------

## bladekernelpanic

Hi all,

I have a similar problem on my system. I use LVM2 since one week with 4 disks and I have one logical reiserfs partition of 500 GB. Using aMule the system freeze. I moved in my LVM partition a big amount of files without problems (250 GB), but during the normal usage of my machine (only for download and ftp)  the system freeze.

Anyone can help me?

----------

## Moriah

Wow!  It's been a year and a day since I posted my message above.  I am still having the same problem.  Apparently not enough people are seeing this to make it worth fixing.    :Mad: 

----------

## eccerr0r

 *shepmaster wrote:*   

> Hey all:
> 
> After being burned by a drive going south, I decided to get myself situated with a better drive solution. I currently have 3 drives (250/250/120) that are round-robined RAID1. i.e. 60 GB from each 250 is mirrored on the 120, and the remainder of the 250s mirror each other. Each of these RAID partitioned are coupled into a LVM, so I can resize and shift as I like. I am using ext3 as the FS. Each physical drive is on its own IDE bus (hda/hdc/hde).
> 
> 

 

You have a weird configuration... but I suppose it should work.   Are you doing this (and partition cylinders in order):

hda: 250G

hda1: 190G

hda2: 60G

hdc: 250G

hdc1: 190G

hdc2: 60G

hde: 120G

hde1: 60G

hde2: 60G

md0: hda1+hdc1

md1: hda2+hde1

md2: hdc2+hde2

Are there any specific partitions that fail?  I'd be most suspicious of the md1 and md2 arrays but if all other things are fine, md0 should have no problems.  Did you try setting up the logical volumes so that the spanned only one physical volume?

 *shepmaster wrote:*   

> Sometimes it will lock up (keyboard lights function, no input allowed, no ssh connectivity), sometimes I get the following:
> 
> attempt to access beyond end of device
> 
> dm-0: rw=0, want=27230863656, limit=209715200
> ...

 

The error seems to point you have some corruption on the disks or some bug with lvm/raid.  Have you upgraded to the newest kernels?  Also did you try fscking your disks to see what else is screwed up?  Is your computer's RAM/CPU/MB in good shape

I have not run into these issues with RAID, probably due to my 4x120G disks being nearly identical to each other - also running LVM2 over RAID5.  It's been pretty stable for me.

Sometimes I wonder about what I did with my RAID5 just to make life easier with booting (no initrd needed)...  now I wish I simply made a huge lvm with huge partitions, just to make things simpler...  Then again I don't snapshot my LV's, I just tar them...

----------

## pdr

I've been running ext3->lvm2->raid-5, and some ext3=>cryptsetupLuks->lvm2->raid5, with / as ext3->raid-1 and /boot as ext2->raid-1 for a year 24-7 (with some reboots after kernel updates) with no problems. I'd check for network problem - I remember seeing some posts about people having problems with long transfers..

Edit: btw, that's on amd64 profile; I have raid-1 on x86 profile on the workstation, but that has only been running for a few months. Did occaisonally have a disk not show up and had to reboot and re-assemble, but that was a loose sata cable.

----------

## pharoh

this is your problem *Quote:*   

> attempt to access beyond end of device
> 
> dm-0: rw=0, want=27230863656, limit=209715200 

  you have a bad partition setup or a bad drive/drive firmware.  We deploy 4x250G raid5's with lvm on top and never see this.

----------

## Moriah

I am seeing it *OCCASIONALLY* but mainly during networked rsync backups of lvm snapshotted volumes.  I am using reiserfs on top of lvm, and that on top of either a simple disk partition, or a raid-1 of 2 or 3 disks, or a raid-5 of 3 disks.  In all the raid cases, the drives are 250 gb with a partition that nearly fills the drive, and raid on that, then lvm on that, then reiserfs on that.  I am seeing it on x86, amd duron, amd athlon, and amd64.  I have lived with this problem for several years (at least 2 -- I don't exactly remember; its been like living with cronic bronchitis: you sort of get used to it after a while).  It is only a *BIG* problem if it takes down a server; otherwise, a simple reboot of a workstation gets it going again.  I see kernel panin -- not syncing messages on the console of the affected machine after it crashes.

----------

## rich0

I just ran into similar problems.  I have two raid-5 arrays using software raid, in a single volume group.  

I ran an fsck the other day after having some system crashes (which turns out were caused by a bad sata drive - which I replaced/rebuilt).  I got tons of errors, but they were on a less-than-super-essential filesystem so I let e2fsck fix them.  Suddenly I started getting all kinds of problems on numerous partitions.  Obviously the fsck on one lvm partition was somehow modifying data on other partitions.

At this point the system won't boot save in emergency mode, and I can mount some of the more essential partitions, although the system will completely freeze if I run certain programs (probably half the stuff on the disk).  I have backups of the most essential data, so I'll try to rescue whatever I can over the network and start over.

I suspect that there is some kind of lvm bug at work.  I saw other posts about the attempts to access beyond the end of a device in other threads.  I'm running the latest stable amd64 lvm and kernel (don't have the details handy - no network access at home with the system down...).  The thing that concerns me is getting everything running only to have it fail again...

----------

## rich0

I'm pretty sure this is some kind of lvm bug.  It might be triggered when an underlying raid is in recovery mode based on some other mentions of this issue online (I did have a hard drive die recently - and while it was dying the computer froze up frequently which probably didn't help).  It seems to have been reported by otherws with both ext3 and xfs.

My non-lvm raids for  boot and root are unscathed.  At this point I'm seriously considering just ditching lvm and putting an ext3 partition directly on each of my two large raid-5s.  My case won't really hold much more in the way of drives anyway, and I can still resize the ext3 and the raids if the need arises.  Sure, pvmove is nice to have, but not if I have to try to rescue my data every year or two.  

What I'm really looking forward to is reliable zfs on linux (non-fuse - or at least fuse with mainstream support for booting/livecds/etc) with reshaping capability.  That would be the best of all worlds.  The copy-on-write and checksumming should really help protect data when the system goes down with dirty filesystems.

Fortunately it looks like most of the non-active files on the lvm partitions were untouched, so I can restore quite a bit of the low-value data that wasn't backed up (mythtv/etc).

----------

## Moriah

It is not confined to RAIDs.  I see the problem occasionally on simple IDE drives with LVM too.  Almost always occurs when a snapshot is active during rsync network backups.  This causes heavy disk and network activity, and ssh crypto for the tunnel, so everything is working pretty hard.

----------

