# ext4: Preventing corruption on power loss/crash

## haarp

Hey,

my main Gentoo machines recently had some problems with power management and switched off suddenly. One might assume that modern journaling fs could handle this, but nooo. On the next boot, I am greeted by funny messages such as:

```
ext4_orphan_cleanup: deleting unreferenced inode 123456
```

and indeed, some files are now missing or got emptied. Another machine of mine, Debian this time, experienced uncorrectable RAM ECC errors which triggered an instant reset. Ok, gotta replace the memory module soon, but at least my data is fine. Ooh wait, ext4 screwed up, leaving me with nuked files.

In both cases, standard mount options were used. Kernels were 3.5 and 3.2 series.

Since it is apparently impossible to recover the filenames involved, I have to live with the fact that ext4's suckiness corrupted random files on the system with no way whatsoever for me to restore them from backups (which do exist), because it never told me the name of the files, only the inode number.

However, I would like to prevent this from happening in the future. ext4 has a ton of different mount options, with many unclear explanations, hundreds of sites on the Internet providing outdated or conflicting information and different recommendations. It is impossible for me to determine what to do. I just want my fs to be power-loss-proof without having to go to extreme measures (such as mounting sync; 2MB/s write speed is not acceptable)

And please, don't recommend xfs/jfs/zfs/reiser4/btrfs or other experiments. I'd like to stay with the stable ext-series (despite recent events having disproven ext4's stableness for me)

Thanks!Last edited by haarp on Thu Oct 17, 2013 9:35 pm; edited 1 time in total

----------

## NeddySeagoon

haarp,

No filesystem will maintain your data in the face of a loss of power.

You need a UPS.

You also need backups.  ECC errors are hardware related.  the drive can no longer read its own writing, at least, for the ECC affected sectors.

If its really impossible to recover your files from your backups, your backups are not worth the time you spent creating and validating them.

You did validate them so you know they are good didn't you?

You may not want to put the effort in to identify and restore affected files but thats a value judgement only you can make.

----------

## haarp

 *NeddySeagoon wrote:*   

> haarp,
> 
> No filesystem will maintain your data in the face of a loss of power.
> 
> You need a UPS.
> ...

 

Thanks for your answer, but sorry Ned, you seem to misunderstand a few things  :Wink: 

- An UPS will not prevent my laptop from failing to suspend properly, which has been one cause of power loss for me.

- I am referring to memory ECC errors, which happened due to a faulty DIMM which has been replaced.

- I can restore anything from backup, but I need to know WHAT I need to restore first! Just an inode number is useless to me.

----------

## Hu

Your laptop will fail to resume from suspend properly if:The BIOS is buggy.  Bug your vendor.The kernel is buggy.  This is usually fixed in short order.You run out of power.  See: UPS.You hold down the powerbutton and force an unclean halt.  Do not do this.More generally, your suspend process should leave the filesystem in a recoverable state even in the event of power failure.  As far as I know, all the major techniques force a filesystem sync before halting.  You will need an fsck afterward to clean up orphans, but no valid files should go missing.  If they do, I would consider that a kernel-related filesystem bug.

As regards backups, I suggest you begin by restoring files that are present in the backup and absent on disk.  This may restore files that were intentionally removed after the backup was created, but you can redo the deletion of those later.

----------

## NeddySeagoon

haarp,

You have a laptop with ECC RAM?

Thats rare.

----------

## haarp

Come on guys, I know that fixing the cause of unclean shutdowns would be preferable. But as we all know, shit happens.  :Wink:  Making ext4 as durable as possible remains the best realistic option.

 *NeddySeagoon wrote:*   

> haarp,
> 
> You have a laptop with ECC RAM?
> 
> Thats rare.

 

No, I have a separate Microserver which uses ECC RAM. Both it and the laptop recently experienced filesystem corruption due to sudden reboots/shutdowns

----------

## Jaglover

haarp,

when it comes to reliability in case of power loss and performance these things are mutually exclusive.

----------

## Hu

Negative external events cannot be completely prevented, but modern consumer-level equipment can arrange for you to have a clean shutdown in the vast majority of cases.  A battery backup that warns the system of imminent power failure, well tested code, and a mollyguard should prevent all but the worst scenarios.  If you more clearly articulate the nature of the power problems you experienced, we may be able to provide better advice about how to protect against it.

----------

## zeronullity

Resume on suspend S3 is hard one to troubleshoot and is what I'm currently working on myself.

You can try a pm_trace which might copy useful information to your rtc memory during a kernel panic on resume.

Also disabling HDD write cache will help prevent loss of information.. and I know it's a laptop but if you have a extra drive in the system.

software raid might not hurt either. Of course disable hibernate/suspend until you can get it correctly resolved.. just make sure 

you backup your data. A raid/file system rebuild is normal after any power loss.. it doesn't mean you lost any data.. You could keep a file

list with check sums for verification and compare it to backed up data.. if your paranoid.

----------

## nlsa8z6zoz7lyih3ap

 *Quote:*   

> However, I would like to prevent this from happening in the future. ext4 has a ton of different mount options, with many unclear explanations, hundreds of sites on the Internet providing outdated or conflicting information and different recommendations. It is impossible for me to determine what to do. I just want my fs to be power-loss-proof without having to go to extreme measures (such as mounting sync; 2MB/s write speed is not acceptable) 

 

Although I have not suffered loss such as you have, the questions of mount options has  intrigued me for some time. Here is a line from my /etc/fstab

```
/dev/root       /               ext4         noatime,noauto_da_alloc,barrier=1,commit=10        0 1
```

My hope is that this is safer than the default options, but I have no idea if this is true. I would appreciate it if anyone has any comments.

DATA LOSS  Once when on holidays We lost (perhaps I accidentally deleted them) all of our digital photo files on my wife's laptop.  I immediately unmounted the partition in question 

and used sys-fs/ext4magic and  sys-fs/extundelete to attempt to get them back. We got about 80% of them back, 'tho the sorting into folders was mostly lost.

I think that at least one of these programs read as much of the journal logfile as it still has, to find the deleted files. Of course this will not likely work on files deleted in the distant past.

Time is of the essence.

COMMENT I always have 2 working  Gentoo root partitions, (just before any emerge -uvDN @world) I sync the second root partition from he first.

That way if I break anything or have problems, I can  easily restore from the 2nd to the first or boot to the 2nd to repair the first. This is like having a built in live recovery cd, in fact it is even better because of the high compatibility of the 2 root partitions.

----------

## kernelOfTruth

*subscribes*

might run into this in the future, too / again

so I have a marker / post to look for

afaik with newer kernel versions ext4 is in much better shape,

but corruption/data loss can still happen

@haarp:

you're another fellow HP MicroServer user ?   :Smile: 

(I still have to set up/fully install that beast ^^ )

----------

## haarp

 *kernelOfTruth wrote:*   

> *subscribes*
> 
> @haarp:
> 
> you're another fellow HP MicroServer user ?  
> ...

 

Oh, yeah! N40L here. 16GB ECC RAM and 2x 3TB in RAID0. Not that impressive, but it's a cool machine. Of course a year later the N54L came out  :Rolling Eyes: . Wish I'd gotten that, the N40L is quite slow with encryption. And I get to RMA the memory modules now, which is a pita

----------

## ryao

 *haarp wrote:*   

> Hey,
> 
> my main Gentoo machines recently had some problems with power management and switched off suddenly. One might assume that modern journaling fs could handle this, but nooo. On the next boot, I am greeted by funny messages such as:
> 
> ```
> ...

 

What you want is not possible with the current ext4 code. Let me give you a brief explanation of file systems. They typically consists of 4 primary things:

The superblock, which describes the filesystem itself. It contains things like the free block list, free inode list, information about filesystem features, etcetera.

The inode, which describes a file's contents and metadata.

The dentry, which is a list of names and inode numbers.

Direct blocks, which store data; indirect blocks, which store pointers to direct blockers; 2nd level indirect blocks, which store pointers to indirect blocks; etcetera.

This basically describes UFS and ext2. ext3 added journalling and ext4 added extents, but for our purposes, we can think about filesystems in terms of this. When you create a file, several things need to be done. Specifically, the following:

You need to allocate an inode by removing an entry from the free inode list.

You need to write to the block containing the inode.

You need to link that inode to a directory by updating the dentry for the directory in which the file was created. The dentry itself is a file, which has its own inode. Therefore we need to modify that file.

This is just one operation, but for the purposes of an example, this should suffice. Ideally, these changes are done in both memory and on disk. However, updating disk structures is slow, so what we can do is just update the memory structures and then do all of the changes needed to perform the file creation asynchronously. Lets imagine doing this with many different files in parallel and suddenly have a kernel panic. Well, things are in varying states of completion. To make matters worse, the directory entries can span multiple blocks and we could have updated that spanned multiple blocks, especially if we were trying to keep things approximately sorted.

If we are clever, we can make this better by implementing a journal. That way, we can record everything we did as part of a journal transaction and replay it to figure out how to make things sane. Unfortunately, doing journalling correctly is slow, so we cut corners by only journalling metadata. That means that we journal changes to inodes and dentries, but not to the direct or indirect blocks. Lets assume that we can implement that perfectly (even though we likely won't). That is great, but the hardware manufacturer wanted things to go fast, so they decided to start doing write buffering and write reordering stuff on us. Therefore, while we think we just wrote something (we got back completion from the underlying device; yay!), we really only moved the information about the change from the system memory to the disk's microcontroller's memory. Since we think we just successfully did a write, the data blocks used to store the journal entries in our perfect journal can now be recycled and we do the next journal transaction. To make things even more fun, the next journal transaction can be written before the previous one. The hardware can also do other neat things, like suffer from bit flips that cause the write that you thought was going to sector X to instead go to sector Y. That is not mentioning uncorrectable errors that SMART detects, in which case, a sector suddenly disappears.

Are you beginning to see how things can go wrong? In reality, filesystems have become far more complicated than what I described, making it much harder to get things right. We have ACLs, extended attributes, extents, etcetera. At the same time, demands for performance have increased. We have highly concurrent systems and even more VFS APIs for trying to get the most concurrency that we can out of storage (e.g. asynchronous IO). Filesystem designers are employed primarily to address business concerns. Specifically, those are getting more stuff done in less time. This can explains design decisions made in the ext2/ext3/ext4 series of file systems. To cite specific examples, the use of metadata journalling and according the the Linux kernel documentation, the failure to implement performance tweaks for full journalling (data + metadata) make sense given the demand for performance.

To provide an answer that is more useful to your question, I will say that you can get a little better reliability from ext4 by enabling full journalling, but this will come at the cost of a significant performance penalty that is even greater than the theoretical penalty because tweaks to implement full journalling efficiently have not been done in ext3/ext4. This will not eliminate all potential ills, but it should be better than what you have now. That being said, there is room for further improvement in the area of reliability, even without designing a new filesystem. Specifically, it is possible to implement ext4 in such a way that unclean shutdowns will always produce a filesystem that can be immediately mounted, where the only problem is a space leak and any instance of such a leak could be fixed with an online consistency check. This was done with UFS in FreeBSD. It is called UFS Soft Updates:

http://jeffr-tech.livejournal.com/22716.html

The same could be done in Linux with ext4. However, there are multiple reasons why kernel developers would not work on this. To name some:

Implementing soft updates is hard.

Anyone trying to improve ext2/ext3/ext4 will constrained by filesystem design decisions of the 1970s that have been shown to be flawed.

There are newer filesystems that started from a clean slate, avoided many of the problems that 1970s filesystem designs had and today are better than any hypothetical improvement to ext2/ext3/ext4. ZFS being the prime example.

It is difficult to convince a corporation to fund such changes to a legacy code base when replacements exist that are superior to what you expect to obtain after the improvements.

With that said, the orphan clean up does not necessarily mean that something was damaged, but that largely depends on how it became an orphan. If it became an orphan because the only directory entry linked to it was damaged by a misdirected write or sudden power failure event, then you just lost data (and you would likely have a directory that looks like it has garbage. If it became an orphan because the last directory reference to it was unlinked and it was not returned to the free list (which is permissible because open file handles count as links), then that is okay.

----------

## mv

 *ryao wrote:*   

> That is great, but the hardware manufacturer wanted things to go fast, so they decided to start doing write buffering and write reordering stuff on us.

 

That's why barriers exist which solve this problem if neither software nor hardware is buggy.

----------

## ryao

 *mv wrote:*   

>  *ryao wrote:*   That is great, but the hardware manufacturer wanted things to go fast, so they decided to start doing write buffering and write reordering stuff on us. 
> 
> That's why barriers exist which solve this problem if neither software nor hardware is buggy.

 

I suggest that you read this:

https://queue.acm.org/detail.cfm?id=2367378

----------

## Gentree

I used reiser4 for many years precisely because of power unreliability. Currently get one or two brown-outs per day. The R4 atomic write strategy seems quite effective. 

Unfortunately it never got integrated into mainline and is now history. I had to buy a UPS.

----------

## ryao

 *Gentree wrote:*   

> I used reiser4 for many years precisely because of power unreliability. Currently get one or two brown-outs per day. The R4 atomic write strategy seems quite effective. 
> 
> Unfortunately it never got integrated into mainline and is now history. I had to buy a UPS.

 

Reiser4 is still maintained:

http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/

----------

## Gentree

Hey, Shishkin is still active on that , great. I thought he'd had to drop it since being bought up by RH. 

Thanks for the info. Having just converted all my partitions to ext4 in anticipation of this kernel up date, I can now change them back again! LOL.

thanks    :Cool: 

----------

## mv

 *ryao wrote:*   

>  *mv wrote:*    *ryao wrote:*   That is great, but the hardware manufacturer wanted things to go fast, so they decided to start doing write buffering and write reordering stuff on us. 
> 
> That's why barriers exist which solve this problem if neither software nor hardware is buggy. 
> 
> I suggest that you read this:
> ...

 

I talked about barriers, not about syncing. Barriers avoid the reorder problems (if software and hardware correctly follows them) and thus avoids all the problems you mentioned which are only implicit consequences of this.

----------

## CkoTuHa

 *mv wrote:*   

>  *ryao wrote:*    *mv wrote:*    *ryao wrote:*   That is great, but the hardware manufacturer wanted things to go fast, so they decided to start doing write buffering and write reordering stuff on us. 
> 
> That's why barriers exist which solve this problem if neither software nor hardware is buggy. 
> 
> I suggest that you read this:
> ...

 

ffs, can you come down from the radical fundamentalist-dualist point of view of yours to something more practical.

Barriers, two-phase commit, etc are all fine techniques to alleviate the issues that you have heard of.

But what ryao is implying is that you can't really trust your disk controller. Read this: if, for instance, you can't trust your hardware when it says "commit done" how are you going to implement something reliable on top of that ? The answer is that you can't.

----------

## Hu

The optimizations that ryao described are useful and there is no serious harm in having the hardware report completion as soon as it is ready for the OS to move on.  As I read mv's post, hardware which claims to implement barriers and reports completion before it is truly done is not implementing barriers properly.  I do not believe ryao was asserting that you can never trust the disk controller to get anything right.  Rather, he was describing that if you want the hardware to take an action, you need to be sure to request that action explicitly, rather than relying on the implementation taking that action as a side effect of doing what you requested.  In this case, if you want a strong guarantee that data has been written, ask for it by issuing a barrier.  If you just need the data to be scheduled for eventual write, then you can assume a completion report means the disk will get to it if power does not fail first.

----------

## ryao

 *Hu wrote:*   

> The optimizations that ryao described are useful and there is no serious harm in having the hardware report completion as soon as it is ready for the OS to move on.  As I read mv's post, hardware which claims to implement barriers and reports completion before it is truly done is not implementing barriers properly.  I do not believe ryao was asserting that you can never trust the disk controller to get anything right.  Rather, he was describing that if you want the hardware to take an action, you need to be sure to request that action explicitly, rather than relying on the implementation taking that action as a side effect of doing what you requested.  In this case, if you want a strong guarantee that data has been written, ask for it by issuing a barrier.  If you just need the data to be scheduled for eventual write, then you can assume a completion report means the disk will get to it if power does not fail first.

 

The tricks done by hardware make it very easy for changes made by file system developers to introduce subtle reliability problems that are not easily realized until something goes wrong in front of the developers. The competing demands to produce something that performs as well as possible and does not immediately eat all of the data given to it make this more likely to happen than you would expect. Even if they do everything right, the hardware can still mess up by reading/writing to the wrong sectors or failing to return what was written due to an uncorrectable error caused by the ECC calculations on the low level formatting. ZFS is one of the few filesystems that implements the redundancy and checksums required to handle that.

That is why the original poster's request for mount options that prevent things from going wrong is impossible satisfy, which was my point.

----------

## mv

 *Hu wrote:*   

> if you want a strong guarantee that data has been written, ask for it by issuing a barrier.

 

AFAIK barriers do not guarantee a write - this probably cannot be guaranteed with current hardware.

What barriers do guarantee is that actions coming after a barrier are not reordered before any action before the barrier.

In particular, if you write to a journal, then commit a barrier, then write the actual data, and then commit a second barrier, you can be sure that the journal is written before committing to disk, i.e. in case of a powerloss then, depending whether this happens before or after the barrier, either you have a broken journal, or the journal has not yet been played back completely, respectively.

----------

## mv

 *ryao wrote:*   

> Even if they do everything right, the hardware can still mess up by reading/writing to the wrong sectors or failing to return what was written due to an uncorrectable error caused by the ECC calculations on the low level formatting

 

Indeed, I have here sometimes a powerloss of about 1 second: Whenever this happens, it is usually not enough that the system halts immediately, but there is enough rubbish in ram which is then written to disk, so that afterwards usually my ext4 filesystem is completely broken and can at most be repaired with tricks like using the second copy of the superblock etc. I enabled all possibilities like journal_cecksum etc, but nothing helped.

----------

## ryao

 *mv wrote:*   

>  *Hu wrote:*   if you want a strong guarantee that data has been written, ask for it by issuing a barrier. 
> 
> AFAIK barriers do not guarantee a write - this probably cannot be guaranteed with current hardware.
> 
> What barriers do guarantee is that actions coming after a barrier are not reordered before any action before the barrier.
> ...

 

For what it is worth, Linux's block subsystem does not do barriers anymore:

http://lwn.net/Articles/399715/

----------

## mv

 *ryao wrote:*   

> For what it is worth, Linux's block subsystem does not do barriers anymore:
> 
> http://lwn.net/Articles/399715/

 

Maybe I do not completely understand, but in the corresponding discussion it is mentioned that the patch concerns a case which was used only in the reiserfs with "barriers". Since at the time of discussion certainly ext4 already supported "barriers", I guess this is about a different mechanism though that is also called "barriers".

Edit: Read carefully the whole thread again, and it seems that you are right: They were replacing in the kernel a mechanism which was known to work by a presumable faster mechanism which is known to fail on a lot of hardware. What a shame for linux!

----------

## haarp

```
EXT4-fs (md0): 67 orphan inodes deleted
```

AAAAHHH!!

That's 67 inodes of random untraceable corruption! I need to get away from EXT4. Anything but this...

----------

