# [still unresolved] Mysterious massive file

## chrismortimore

I have a 100GB /home partition, and noticed that there was a 16TB file on it (it is actually meant to be a 2MB photo of Arthur's Seat, I digress).  When I deleted it, my /home partition remounted read-only in that kind of "I just had a massive error" way.  Has anyone ever had this problem?  And any ideas what could have caused it?  Here are the relevant parts of /var/log/messages:

```
Aug  6 11:20:51 marbles EXT3-fs error (device dm-3): ext3_free_branches: Read failure, inode=8881856, block=97460224

Aug  6 11:20:51 marbles EXT3-fs error (device dm-3) in ext3_reserve_inode_write: Journal has aborted

Aug  6 11:20:51 marbles EXT3-fs error (device dm-3) in ext3_truncate: Journal has aborted

Aug  6 11:20:51 marbles EXT3-fs error (device dm-3) in ext3_reserve_inode_write: Journal has aborted

Aug  6 11:20:51 marbles EXT3-fs error (device dm-3) in ext3_orphan_del: Journal has aborted

Aug  6 11:20:51 marbles EXT3-fs error (device dm-3) in ext3_reserve_inode_write: Journal has aborted

Aug  6 11:20:51 marbles EXT3-fs error (device dm-3) in ext3_delete_inode: Journal has aborted

Aug  6 11:20:53 marbles ext3_abort called.

Aug  6 11:20:53 marbles EXT3-fs error (device dm-3): ext3_journal_start_sb: Detected aborted journal

Aug  6 11:26:11 marbles EXT3-fs warning (device dm-3): ext3_clear_journal_err: Filesystem error recorded from previous mount: IO failure

Aug  6 11:26:11 marbles EXT3-fs warning (device dm-3): ext3_clear_journal_err: Marking fs in need of filesystem check.

Aug  6 11:26:11 marbles EXT3-fs warning: mounting fs with errors, running e2fsck is recommended

Aug  6 11:26:11 marbles EXT3 FS on dm-3, internal journal

Aug  6 11:26:11 marbles EXT3-fs: recovery complete.

Aug  6 11:26:11 marbles EXT3-fs: mounted filesystem with ordered data mode.
```

11:20:51 is when the file was deleted, 11:26:11 was when I unmounted and remounted /home by the way.

----------

## NeddySeagoon

chrismortimore,

Having a 16Tb file on a 100Gb partition  may be a problem. If it were  a 'sparse' file, its allowed becaused only used space is allocated.

However, images are unlikely to be sparse files, so the filesystem was already corrupt at that point. Deleting the file will have made it worse.

Make a file image of /home before you do anything else. Well, you can run 

```
fsck -n
```

 so it won't repair, just tell you what it would like to do. However, the 'repair' can make things worse.

Does (device dm-3) hint at /home being on a fakeraid volume ?

----------

## chrismortimore

OK, I'm finding hundreds of these things around this computer.  I have this horrible feeling that the hard drive is knackered...

----------

## NeddySeagoon

chrismortimore,

Reboot with the liveCD and run e2fsck -n on all your ext3 partitions but do not permit any repairs.

It sounds like the drive is OK but the filesystem has got trashed somehow.

----------

## chrismortimore

I did that, it threw up a load of errors about i_nodes having invalid sizes and such like.  Everything is backed up or not important, so I permitted the repairs and it was working fine for a while, then more massive files started appearing.  Like /etc/pam.d/imap was fine, then all of a sudden it grew to 17TB.  So I rebooted, ran e2fsck again, and it threw up errors again.  Any ideas on what could be happening?  All I can think of is either the drive is breaking in a very very strange way, or the ext3 driver in the kernel is really sick.  And I'm running the current stable amd64 kernel.

EDIT: Thats a thought, up until yesterday it was running an x86 tree, and last night I did a reinstall to amd64 and that is when the errors started coming up.

----------

## i92guboj

Just to discarnd a hardware error, I would use smartctl (from smartmontools, also in portage) from a livecd.

Something like 

```
smartctl -H /dev/hda

```

Addapt it for your main drive and run that to see what happens. You might need to enable smart in your hd before that with "-s on". Sometimes this can help. If it does not report anything strange, then we can start to think about another possibilities.

----------

## chrismortimore

I run smartctl every day and check the results, and there is nothing abnormal.  But I've seen hard drives die even when their SMART data say they are in perfect health.

----------

## NeddySeagoon

chrismortimore,

You switched from a 32 bit to a 64 bit install, then it broke.

If you over clock - don't.

It may be PSU. 64 bit operation leads to higher ripple currents than 32 bit. 

Do you have some PSU margin ?

It can also be the Vcore PSU on the motherboard - capacitor failure.

You need to be handy with a soldering iron to fix that.

Its unlikely to be memory - how memory is used is unchanged

----------

## chrismortimore

All the kit is brand new, with the exception of the hard drive, which had an existing x86 system on it.  I worked out that on a really bad day, at most I'd need a 400Watt, this is the PSU I'm using just now: http://www.ultraproducts.com/product_details.php?cPath=42&pPath=296&productID=296

I reckon it has enough headway.

I havn't overclocked this system, havn't felt the need to.

Where abouts would this capacitor be if it had failed?  And is there a nice easy way to test it?  I don't really want to pull them all off and test them in the multimeter...  The soldering isn't a problem, I'm an Electronics and Electrical Engineering student and a part time guitar tech, I do muchos soldering  :Wink: 

EDIT:  My mobo (Asus A8V-X) has a VIA VT8251 Southbridge, which I hear still has slightly questionable Linux support.  Could this perhaps be the problem?

----------

## NeddySeagoon

chrismortimore,

There are about 10 capacators clustered round the CPU.

Domed tops, rubber bungs being pushed out, or liquid contents leaking onto the motherboard are all bad signs.

If you see any of that - return the motherboard since its new.

You can't test them with a multi-meter, if they look ok, they probably are.

To test, you need to measure the ESR and loss angle.

----------

## chrismortimore

They all look ok..  If I put a capacitor into my multimeter, it tells me the capacitance of it, and I take it as read that if I get the right value from the meter, the capacitor works  :Wink: 

What I've done is I've backed up everything, unplugged everything except the system drive, and I'm gonna run the system and see if it keeps happening..  It has behaved for the last hour, so I don't know what is going on with it..

Thanks for your time and help  :Smile: 

----------

## NeddySeagoon

chrismortimore,

That tells you the capcatance under the conditions at which it was measured.

The actual capacatence varies with temperature and because of the very high ripple current in the Vcore regualtor, the caps do run hot, hence the importance of using low ESR devices. Thats partly why multiple devices are used, rather than one bigger one

----------

## beatryder

This might sound crazy, but have you tried downgrading to an older kernel? I know that for a while I could not use any kernel newer than 2.6.12 as I had "strange" problems that were impossible to diagnose. Once 2.6.16 came out, things got better, now I am running 2.6.17 and it works fine, but before that I always ended up downgrading back to .12

----------

## chrismortimore

@NeddySeagoon: True, but for my budget, my lousey test is enough for me  :Wink: 

@beatryder: I considered that, but I've noticed in the ChangeLog that VT8251 support came in at kernel 2.6.17, likewise for my sound card.  I'll give 2.6.16 to see if it helps though, couldn't make it worse.  At least if it can't find the drives it can't mess them up  :Wink: 

----------

## chrismortimore

I think I've figured out the problem, although it makes absolutely no sense to me...  The computer is absolutely fine until I run "rsnapshot", and then loads of massive files appear around the filesystem.  I have no idea why...  I'll run the computer for a week without running rsnapshot and see what happens.

----------

## chrismortimore

After a couple of weeks of working fine, it's happened again.  I found two files that are 17EB each.  I wonder what on earth is causing it...  It's quite strange, they only seem to appear when I read heavily from the partition, and they always seem to be under /home.

----------

## Janne Pikkarainen

Does your computer have any other strange symptoms? Crashes, lockups or segmentation faults under heavy compilation? Odd random program crashes telling you about Signal 11? If I were you, I would test the memory with memtest86 ... maybe your computer works ok until it gets enough memory filled with cache, and then same faulty part of RAM kicks in and messes everything up.

Well, just a guess anyway.

----------

## chrismortimore

 *Janne Pikkarainen wrote:*   

> Does your computer have any other strange symptoms? Crashes, lockups or segmentation faults under heavy compilation? Odd random program crashes telling you about Signal 11? If I were you, I would test the memory with memtest86 ... maybe your computer works ok until it gets enough memory filled with cache, and then same faulty part of RAM kicks in and messes everything up.
> 
> Well, just a guess anyway.

 Computer is rock solid, not a single crash.  The RAM and the hard drives were kept on from the machine this one replaced, but they have been working fine for at least a year with absolutely no problems either.  I'll run memtest86 later and see if that flags anything up, cheers for the suggestion.

Just now, the cache is full and it is running OK.  Literally, the only time it happens is when the hard drive is under heavy load, such as running backups or moving large amounts of files from one partition to another.  Could this be caused by buggy IDE controller code?  I also thought it could be the ribbon cables, but as these ones have been running fine in the old machine, I figure this is unlikely.

----------

## Gentree

Dont make too many assumptions like "it worked fine in the other machine so it must be OK on this board". As an electronics student you must appreciate that the two test beds can hardly be deemed identical.

I'm surprised that you have not run memtest86+ before now. That would have been my first hardware check. Always is the place to start.

let it run at least an hour in the first instance (give it 24h later if you dont fix it), if that looks clean try burncpu test suite. It lets you stress different parts like northbridge cpu and gfx system.

DO make sure you are safe on temperature , temp monitors and BIOS shutdown before doing that , it's not called cpuburn for nothing. Be careful and rigourous.

HTH   :Cool: 

----------

## chrismortimore

 *Gentree wrote:*   

> Dont make too many assumptions like "it worked fine in the other machine so it must be OK on this board". As an electronics student you must appreciate that the two test beds can hardly be deemed identical.

 That is true, but given these exact components worked fine under x86 for a couple of days, and then the very night I switched over to amd64 it started hurling errors makes me swing to thinking it is unlikely a hardware problem.

Anyway, I shall go about the stress tests.

Thank ye.

----------

## Gentree

that's a fairer comparison but I guess you're switching to 64 bit for performance increase so you should already be assuming that you will be pushing the same hardware harder, that the whole point.

hopefully stress testing will show up something , o/w back to looking for a 64b bug. At least you will have reduced the variables.

good luck.   :Cool: 

----------

## chrismortimore

I ran "burnK7 || echo $?" for an hour, and it returned 143, any ideas what that means?  The manual isn't very clear on that...

Core temperature maxed at 44C, system temperature maxed at 28C (room stayed a steady 25C) and voltages were stable and normal.  The core was 1C more than it hits when it did an "emerge -e world", so I'm not worried about that.

Gonna run "burnMMX" now and see what it gives, the manual claims it's quite memory intensive so I figure it can't do any harm.  After that, it's memtest86+ time (I'm weary of it because I like being able to watch my sensors or badness...  I'll just keep telling myself it'll be worth it  :Wink: )

Edit: I've actually noticed no real performance increase to be honest.  I switched to amd64 because I wanted to see what all the fuss was about  :Wink:   I also figured that more and more stuff will take advantage of the extra bits now that the processors have become common, so even if it isn't worth it now, it will be fairly shortly.

----------

## Gentree

I'd say memtest is your first stop really. It is not in the same catagory as the burn suite , it is more of a thorough test than a stress test. From your temps on burn* you should have no worries on memtest. (reboot and check bios after a couple of mins if you want to see it for yourself)

If any burn* stops, it failed, they should loop indef. Dont know if theres much point in searching 143, it failed.

Again memtest86+ is where to look first. 

 :Cool: 

----------

## chrismortimore

 *Gentree wrote:*   

> If any burn* stops, it failed, they should loop indef. Dont know if theres much point in searching 143, it failed.

 Actually, I killed it after an hour because I was convinced by it  :Wink:   The manual said on errors it returns a 254 or 25x (I've forgotten what the last digit is..), depending on why it failed, so a 143 seems to just be a meaningless number to everyone except the dev.

burnMMX isn't even pushing the system, so it's memtest time.  Bet this'll take a while, the down side to 2G of ram..  :Sad: 

Cheers.

----------

## Gentree

LOL . you omitted to say you killed it. Yep I would say 1h is solid.

they probably figured if you killed it you would know why it stopped! In that case it's not an error so it did not produce the 25* error values.

burnMMX will have stressed (a part) of the system , it's true it does not produce as much cpu temp as some of the others , that is consistant because it's funtion is not to stress the cpu.

In practice , if you have a mem prob it will come up in most cases pretty quickly (15/20m) but to be rigourously tested you really should let it roll all night at some stage.

If you're still here you should try burnBX, that is one that I find trips out first when I'm pushing my o/c tests.

If it makes you happier that one will create some watts as well   :Wink: 

----------

## chrismortimore

I'll give burnBX a whirl if memtest fails to show anything.  35 minutes so far and at 59% *sigh*

At least I still have my laptop to play frozen-bubble on to kill time  :Wink: 

EDIT: memtest ran one and passed with no errors.  I'm outta ideas...

----------

## Gentree

OK , if you're happy that you have given stress tests a fair go come back to software (I suggest pushing ahead with burnBX and an overnight run on memtest86+ before skipping ahead) 

First are you running RAID on those WDs? can you move /home to another physical device and see if it still happens?

second you said the mobo support is very recent. You could try a different kernel , maybe some more work has been done or at least some glitch may no longer be present.

I think no-sources is on 2.6.18 now may be worth a look.

3, if it is some oddity affecting the fs try a different fs. reiserfs just go be different , not a long term recommendation, or reiser4 if you opt for no-sources you have it available. This is very robust now , I've had several power-offs that have not even required an fsck. Worst loss is last edit on a file.

Well, if you're out of ideas there's some. Hope it helps.

 :Cool: 

----------

## chrismortimore

No RAID anywhere, but every week (or every big change) I rsync the changes to the WDs to backup, and the backup drives are unplugged from the PSU when not in use, so the data is safe and I'm not too worried about losing anything off the working drive.

I shall read recent ChangeLogs and patches and see if any work has been done on my chipsets.  From what I gather (from linuxhq.org) is they are up to 2.6.17.9 and 2.6.18-rc4, so I'll take a peek later on.

I guess there is no nice ext32reiser program out there?  :Wink:   I'll give reiser a go if all else fails.  Another idea I had was in /var/log/messages, every time something goes wrong, the journal gets mentioned, like so:

```
Aug 15 12:22:06 marbles kjournald starting.  Commit interval 5 seconds

Aug 15 12:22:43 marbles Aborting journal on device dm-3.

Aug 15 12:22:43 marbles EXT3-fs error (device dm-3): ext3_journal_start_sb: Detected aborted journal

Aug 15 12:23:03 marbles __journal_remove_journal_head: freeing b_committed_data

Aug 15 12:23:05 marbles kjournald starting.  Commit interval 5 seconds

Aug 15 12:23:05 marbles EXT3-fs warning (device dm-3): ext3_clear_journal_err: Filesystem error recorded from previous mount: IO failure

Aug 15 12:23:05 marbles EXT3-fs warning (device dm-3): ext3_clear_journal_err: Marking fs in need of filesystem check.

Aug 15 12:23:05 marbles EXT3 FS on dm-3, internal journal

Aug 15 12:28:32 marbles Aborting journal on device dm-3.

Aug 15 12:28:32 marbles EXT3-fs error (device dm-3): ext3_journal_start_sb: Detected aborted journal

Aug 15 12:29:20 marbles __journal_remove_journal_head: freeing b_committed_data

Aug 15 12:29:20 marbles __journal_remove_journal_head: freeing b_committed_data

Aug 15 12:29:20 marbles __journal_remove_journal_head: freeing b_committed_data

Aug 15 12:29:20 marbles __journal_remove_journal_head: freeing b_committed_data

Aug 15 12:29:24 marbles kjournald starting.  Commit interval 5 seconds

Aug 15 12:29:24 marbles EXT3-fs warning (device dm-3): ext3_clear_journal_err: Filesystem error recorded from previous mount: IO failure

Aug 15 12:29:24 marbles EXT3-fs warning (device dm-3): ext3_clear_journal_err: Marking fs in need of filesystem check.
```

Now, I have very little idea what those errors mean, but I was thinking, is it the journal breaking or saving the data?  One idea I had was switch back to ext2 and see if that solves it.

Mucho thank ye for the help.

----------

## Gentree

I guess there is no nice ext32reiser program out there?  :Question: 

sounds like kde has lost its marbles.  This is the kind of thing I dislike in a WM and why I have always shyed away from KDE. It smacks too much of the windows approach (indeed that's exactly what it aims to be.) Far too many daemons lurking around for my taste.

try reiser4 that will leave it scratching its marbles.

just watch out on 2.6.18-_rc* if you use xfs, there are "issues" on some systems.

you may even find just moving it to reiserfs may avoid this particular daemon.

good luck.  :Cool: 

----------

## olger901

Have you tried to simply replace the harddrives yet, or install a seperate PATA or SATA Controller and moving large files around using the 3rd-party controllers and/or new HD's?

----------

## Nick C

going back to possible software fixes, have you tried using any of the .18 rc kernels that are in unsupported software forum? If its a driver problem thats improved from .16 > .17 then it may have been fixed/gotten better still in .18.

----------

## Gentree

 *Quote:*   

> I guess there is no nice ext32reiser program out there?

 

I doubt you'll find a tool to covert it peicemeal on the fly without saving the data elsewhere if that's what you're hoping for.

You seem to have some heavy-duty b/u of your system and lots of disc space so why not tarball it or cp -ax ?

I recall you said this seems only to happen on /home , is that a separate partition?

 du -h /home ?

 :Cool: 

----------

## chrismortimore

Sorry, been busy over the weekend and not checked the forums for a while.  Right...

Still to try the .18rc kernels, it will be my next port of call.

It has happened on /home, /usr, /var and /etc so far, / /home /usr /var /opt and /boot are my partitions.  I usually find it when I run backups, but this weekend it didn't happen.  Maybe it fixed itself... We'll see...

I have a spare 80GB drive, so I'll migrate to that one first and copy the data over, and see how that goes under ext3.  If it still plays up, I'll try reiser.

On a side note, I like KDE, but I agree there is too much stuff in there.  I use a minimalist version (the joy of the split ebuilds), and find it quite nippy.  Although I do like XFCE4, I'm gonna give XFCE 4.4 a go once the desktop settles down and behaves.

Cheers again for the ideas

----------

## Gentree

 *Quote:*   

> It has happened on /home, /usr, /var and /etc so far, / /home /usr /var /opt and /boot are my partitions. 

  so it's pretty much system wide , I'd guess /boot is ext2.

You said you have the WDs disconnected most of the time , does this only happend on the main drive?

Are you doing entire system backups from a liveCD or just selected backups on a running system?

You could try circumventing the fs by doing backups with dd command. That may help decide if it's HD or fs software.

Also if you fancy trying xfce there is a nice tidy overlay to build it from cvs.

http://overlays.gentoo.org/proj/xfce/

 It is pretty stable by all accounts for a while now. I have had no issues other than one pkg failed to build . I posted a bug and it was fixed and in cvs within 15mins.   :Shocked:  I was impressed. Apart from that it has been rock solid.

again , may be a way to eliminate some variables.

 :Cool: 

----------

## chrismortimore

The whole lot is running ext3 actually, my guess is that /boot and /opt get very little usage, and hence have remained normal.

I backup with this command on a live system:

```
rsync --archive --delete --links --progress --stats --verbose --exclude /dev --exclude /home/ccache --exclude /mnt/gentoo32/dev --exclude /mnt/gentoo32/home/chris --exclude /mnt/gentoo32/proc --exclude /mnt/gentoo32/sys --exclude /mnt/gentoo32/tmp --exclude /mnt/gentoo32/usr/portage --exclude /mnt/gentoo32/usr/src --exclude /mnt/gentoo32/var/tmp --exclude /home/portage --exclude /mnt --exclude /proc --exclude /sys --exclude /tmp --exclude /var/tmp --exclude *lost+found --include /mnt/gentoo32/ / /mnt/backup/system/marbles/
```

So far, it's only happened on the main drive.  The backup drives are unplugged from the PSU unless I'm actually backing up, and the "multimedia" (music, vids, that kinda thing) drive is mounted readonly normally and gets little use (just either a vid or amarok playing away in the background).  So really only the main drive gets used heavily, which is why I think that it's all occured on that one so far.

Just now, I'm pretty convinced that it is a software issue.  I have a winxp installation on the main drive (just a little one for Halflife 2 and Far Cry), and it has been absolutely fine, and with those massive levels to load, that'd put the drive under fairly weighty use.  Also, I used dd to make an image of it not long ago (last week some time) and it came out fine (its a 20G partition by the way).

----------

## Gentree

 *Quote:*   

> Just now, I'm pretty convinced that it is a software issue. 

  certainly looking that way. BTW , despite my instant mistrust of anything that begins with the letter K , kjournald is a kernel daemon , not kde.  :Razz: 

It takes care of ext3 journalling, so there's a high prob that switching fs will get around this problem even if it does not explain it.  :Cool: 

----------

## Drake Mallard

Just a thought, I didn't see this addressed anywhere in the thread, but have you tried replacing your IDE cables and/or using a different IDE channel?  I've had damaged IDE cables cause this kind of unhappiness in the past.

Since the problem started cropping up when you changed hardware and platform trees, perhaps it would make sense (and, if it works, save time) to back up all of your configuration data, home directory, etc., remake the filesystem, rebuild the system from scratch from the new tree, and restore your saved configuration.

Good luck  :Smile: 

----------

## chrismortimore

You know, I never thought of changing the ribbon cables...  I have a spare one sitting around, so I'll give that a go.

Trying a different channel could be hard though, 7 IDE drives and 4 channels does not add up nicely  :Wink: 

And I'm not too hot on the idea of a reinstall, to me it isn't the Linux way.  Excluding making virtual computers under qemu/vmware, I'm proud to say I've only ever had to install Debian 3 times (once on each machine I had it on) and Gentoo 3 times (once on the laptop, once on the desktop as x86, and a reinstall on the desktop as amd64) in almost 6 years.  I lost count ages ago of how many times I had to reinstall Windows in that same time period.

What I have noticed is that these files have been occuring less and less as time as gone on.  Which is just strange, because the machine's usage has been pretty much constant.

OK, so my new plan:

1. Switch cables

2. Try 2.6.18-rc kernels

3. Switch to reiser

4. Cry

But anyway, it's way past my bed time and I need my beauty sleep...

----------

## chrismortimore

I had a thought last night.  My laptop and when I ran the desktop under x86 used kernel 2.6.16, and have no problems, so I've switched back to 2.6.16-gentoo-r13 to see if that helps.  When running amd64, it has only ran 2.6.17-gentoo-r4.  What was keeping me on 2.6.17 was that my sound card needed it (it's one of these new fancy HD sound thingies), but I found an old Yamaha card from many moons ago, so I've put that in for now.  Lets see if 2.6.16 works any better.

Cheers

----------

