# Ext4 fs corruption

## gtbX

I'm having an odd issue with one of the machines I remotely administer.  Recently, it's /home partition started developing filesystem errors that prevent it from being mounted at boot.  Instead it drops to a login screen, and I have to walk someone through logging in and running fsck -y on the partition.  It seems to need it every time it reboots now.  I tried reformatting the partition with

```
mke2fs -t ext4 -c -c /dev/sda7
```

 to scan for bad blocks, but it didn't find any.  I suppose it might be the superblock(?) that's bad, but I would think that would've been detected too.

So I have 2 questions:

1. What could be causing this/how to prevent it?

2. Can the init scripts be configured to keep booting, even if /home fails to mount, so that I can at least ssh into the box?

----------

## eccerr0r

Remember that corruption doesn't necessarily come from the disk.  Just like any other computer, garbage in, garbage out.  Your CPU could be emitting garbage for the disk to write, or perhaps your RAM has amnesia causing your CPU to write bad data to the disk.

I would think that initscripts should keep on booting without home, but since ~/.ssh lives on home for many users, it would still be hard to ssh in (especially if root is disabled).

----------

## gtbX

I think if it was a kernel or RAM issue, I'd see more problems than just this.  Then again, I first saw this problem shortly after upgrading to gentoo-sources-3.8.13.  I haven't had any issues with the root fs (also ext4), but it gets less I/O.

The init scripts fail at running fsck on /home, and drop to an emergency login:  "Welcome to (none).(none)" or something.  The hostname isn't even set yet.  Conceivably, the network and sshd could be started, and I could login as root (via pubkey of course). 

/etc/fstab:

```
/dev/disk/by-label/ROOT  /               ext4    noatime                 0 1

/dev/disk/by-label/HOME  /home           ext4    noatime                 0 2
```

I'll double check my kernel config, and see if dropping back to gentoo-sources-3.6.11 helps.

----------

## gtbX

It does seem to be kernel-related, I just ran into the same problem on a different machine with the same kernel version.  I rolled back the kernel on the original box to 3.6.11 and the problem seems to have gone away (I'll upgrade the kernel when I get to it in person).  The second box has been upgraded to 3.10.7, I'll have to see if that helps

----------

## eccerr0r

Though I had other issues with gentoo-sources-3.8.13 I have not seen the corruption issue on my ext4 machines.

----------

## gtbX

Crud, it happened again, this time on 3.6.11.  Seems to start when there's an unclean shutdown.  Running fsck manually has it remove some deleted inodes - nothing critical yet, but it's only a matter of time until valuable files get lost.  I thought using a journalling fs was supposed to help with that?  Maybe I'm just doing it wrong.

----------

## eccerr0r

Uh... No.  Even with a journalling filesystem, just shutting down the machine abruptly (like cutting power) is not proper.

Journalling filesystems will *help* but does not prevent corruption.  A proper shutdown is still needed.

If you must have a system that can handle this, it can help more if cached writes are flushed to disk as quickly as possible.  It will reduce performance but will help against corruption.

----------

## trumee

 *eccerr0r wrote:*   

> 
> 
> If you must have a system that can handle this, it can help more if cached writes are flushed to disk as quickly as possible.  It will reduce performance but will help against corruption.

 

How can i do this? It will be useful in situations when power failure is random.

----------

## kernelOfTruth

 *trumee wrote:*   

>  *eccerr0r wrote:*   
> 
> If you must have a system that can handle this, it can help more if cached writes are flushed to disk as quickly as possible.  It will reduce performance but will help against corruption. 
> 
> How can i do this? It will be useful in situations when power failure is random.

 

mount with commit=5 (should be the default no ? forcing nonetheless is safer)

or commit=10 (you could also try 20 to sync every 20 seconds)

or add data=journal as mount option - to force (ext3-like) full journalling mode - is it deprecated yet, btw ?

----------

## trumee

Is the commit option only for ext3?  man mount indicates it as a suboption of ext3. 

At the moment i am running ext4, but was wondering whether ext3 is safer choice for sudden power failures?

----------

## eccerr0r

As the shorter commit times is just a hack to just help limit the damage, I cannot condone this as a "solution".  Journalling filesystems are already helping the problem a bit as it is (unless you somehow disabled the journal) but it's still not right.

The question that's going in my head: Why is the power going out so frequently that such is needed?

If it's due to laziness, people will need to figure out how to shut down normally.

If it's due to unstable power, a UPS or perhaps a laptop configured to do a clean shutdown is highly recommended, this is a "proper" solution.

How frequent is frequent?  Also what is the function of the machine, is it writing stuff to disk constantly?  A disk that's merely just read most of the time should not suffer as much corruption from unclean shutdowns.

Remember, even with these faster commit options, if power goes out while committing, you will suffer problems as well.

----------

