# [SOLVED!]Ext-4 Data Corruption Bug Hits Stable Linux Kernels

## BitJam

link *Quote:*   

> As a warning for those who are normally quick to upgrade to the latest stable vanilla kernel releases, a serious EXT4 data corruption bug worked its way into the stable Linux 3.4, 3.5, and 3.6 kernel series.

 

Forum member szczerb posted this news in a thread but I think it deserves a thread of its own.  

TL;DR: In recent kernels ext-4 journal playback can in some cases bork your file system.

Edit: fixedLast edited by BitJam on Wed Oct 31, 2012 6:18 pm; edited 1 time in total

----------

## khayyam

BitJam ...

Note that you'll only get hit if the journal hasn't been wrapped, so give the journal something to work on or don't reboot so often ;) ... hehe. 

I applied Ted's patch to 3.6.3 earlier today but have't rebooted as yet, anyhow as I've been running an effected kenel (2.6.2) for a week or so without issues I'm not inclined to panic.

best ... khay

----------

## e3k

did not get the phoronix article, which 3.5 kernel is not affected?

----------

## szczerb

This comment https://bugs.gentoo.org/show_bug.cgi?id=439502#c0 seems to suggest that 3.5.x < 3.5.7 should be safe. I just booted the 3.5.7 at work yesterday so I'm waiting things out without rebooting or shutting down for now. Patches seem to be flowing around fast.

EDIT: BitJam, you're right - I should've made it a separate thread. I was rather swarmed at work, so didn't think of it.

----------

## Hu

Linux Weekly News has a free to read article following this.  The situation is evolving.  Ted now believes that journal wrapping may not be involved.  Additionally, nix has now stated that the affected system has some rather unusual shutdown behavior that may cause it to halt without all filesystems finishing their unmount.  If that is what happened to him and if the corruption occurs only on multiple journal replays, then standard systems that gracefully unmount (or remount readonly) all their filesystems are at much lower risk than suggested by early reports.  However, those are substantial qualifiers and there is insufficient evidence to determine whether they are met in the reported cases.

----------

## jimmij

Can someone advice me what is the safest way to switch/downgrade to 3.3.8 if I'm running 3.5.7 on my system for 2 days now without rebooting? If I undestood correctly the problem mostly appear during reboot, so maybe it is better to leave the system running and not downgrade at all until next reboot?

----------

## szczerb

 *jimmij wrote:*   

> Can someone advice me what is the safest way to switch/downgrade to 3.3.8 if I'm running 3.5.7 on my system for 2 days now without rebooting? If I undestood correctly the problem mostly appear during reboot, so maybe it is better to leave the system running and not downgrade at all until next reboot?

 I'm doing just that - waiting with my system on.

----------

## depontius

So the problem appears to be "failing to wrap the journal" before rebooting.  How much filesystem activity does it take to "wrap the journal"?  Simply waiting may not do the trick, it sounds as if you really need to generate some filesystem activity.  Most likely, "emerge --sync" would do the trick, but obviously not more often than once daily.  Since we're talking about the journal, likely reads wouldn't do spit - it takes writes or updates.  Any idea how big/many writes?

----------

## NoDataFound

I'd like to know what kind of corruption it produce.

Having a bug is bad in itself, although not the end of the world, but it's better if it's recoverable...

----------

## khayyam

 *depontius wrote:*   

> So the problem appears to be "failing to wrap the journal" before rebooting.  How much filesystem activity does it take to "wrap the journal"?  Simply waiting may not do the trick, it sounds as if you really need to generate some filesystem activity.  Most likely, "emerge --sync" would do the trick, but obviously not more often than once daily.  Since we're talking about the journal, likely reads wouldn't do spit - it takes writes or updates.  Any idea how big/many writes?

 

depontius ... the situation seems to have moved on (as Hu noted above), its nolonger thought to be related to wrapping.

Note: "Update:  It now looks like the reproduction involved something very esoteric indeed, involving using umount -l and shutdowns while the file system was still being unmounted --- and the user had nobarrier specified in the mount options as well." Ted Ts'o

So, I don't think there is much reason to panic, if this wasn't a corner case then there would be hundreds of reports of data loss, and the actual reported case so far are few.

For the paranoid there is the option to keep the machine up until a patch or update is provided, or a 'while :; do emerge -e @world ; done' for the truely insane :) ... but for the rest of us its best not to blow this out of proportion.

best ... khay

----------

## leifbk

 *khayyam wrote:*   

> 
> 
> For the paranoid there is the option to keep the machine up until a patch or update is provided, or a 'while :; do emerge -e @world ; done' for the truely insane  ... but for the rest of us its best not to blow this out of proportion.
> 
> best ... khay

 

I don't think that will help for other filesystems than the one /var/tmp is mounted on. What about /home for instance?

I for one am still running good ol' ext3, and will keep my newly compiled 3.5.7 kernel until a new one hits stable.

----------

## bandreabis

I can't see any visible difference between 3.3.8 and 3.5.7 (freshly compiled) so I remain with the "not hard masked" one.

----------

## khayyam

 *leifbk wrote:*   

>  *khayyam wrote:*   For the paranoid there is the option to keep the machine up until a patch or update is provided, or a 'while :; do emerge -e @world ; done' for the truely insane :) ... but for the rest of us its best not to blow this out of proportion. 
> 
> I don't think that will help for other filesystems than the one /var/tmp is mounted on. What about /home for instance?

 

leifbk ... certainly keeping the machine up would be enough to not encounter the problem, but as for the jollarity ... it wouldn't help at all *regardless* of what filesystems were modified (note that I pointed out that the current understanding is its not caused by 'wraping'), but there is no telling the "truely insane" that. So, was that attempt at humour missed?

 *leifbk wrote:*   

> I for one am still running good ol' ext3, and will keep my newly compiled 3.5.7 kernel until a new one hits stable.

 

A serious bug in linux kernel has caused users to believe that there is serious bug in the linux kernel, in a post made to the LKML, Linus Torvalds stated "we're not really sure if this is a bug or not, but we can assure everyone we're reading all of the hullaballoo on slashdot and we'll know more as and when news hits critical mass". The bug, code named "worse than y2k, stuxnet, and Window98 combined (WTY2KSTUXNET&W98)" is thought to effect at least three users, and more than ten million blogs and news sites". Users, who until recently had thought that the designation "stable" was a ancronym for "no need for backups any mo", are lining up to throw themselves under the wheels of this runaway train, as one commentator noted "its worse than Fukushima Daiichi and that other thing ... didn't you read my blog post?" :)

best of the bwaaaaa ... khay

----------

## John R. Graham

Goodness. That's more sarcastic than me!

- John

----------

## khayyam

 *John R. Graham wrote:*   

> Goodness. That's more sarcastic than me!

 

John ... the intention was to deflate the rise in panic with some humor. It seems that this serious bug, though no doubt an annoyance to those hit, is most likely a corner case, and so all the "hallaballoo" needs to step down a gear or three. Its already been said that this is reflecting badly on ext4, and some of the reporting has been out of proportion to the actual severity, so I guess my sarcasm reflects this.

best ... khay

----------

## John R. Graham

Never explain sarcasm; it just ruins it.  :Wink: 

- John

----------

## energyman76b

short: don't do anything stupid and you won't hit the bug.

It is really that simple. Phoronix in the mean time is working hard to earn that Moronix moniker.

----------

## Jaglover

 *John R. Graham wrote:*   

> Never explain sarcasm; it just ruins it. 
> 
> - John

 

 :Laughing:  +1

BTW, I'm not the fourth user hit by this.

----------

## leifbk

 *khayyam wrote:*   

>  *leifbk wrote:*    *khayyam wrote:*   For the paranoid there is the option to keep the machine up until a patch or update is provided, or a 'while :; do emerge -e @world ; done' for the truely insane  ... but for the rest of us its best not to blow this out of proportion. 
> 
> I don't think that will help for other filesystems than the one /var/tmp is mounted on. What about /home for instance? 
> 
> leifbk ... certainly keeping the machine up would be enough to not encounter the problem, but as for the jollarity ... it wouldn't help at all *regardless* of what filesystems were modified (note that I pointed out that the current understanding is its not caused by 'wraping'), but there is no telling the "truely insane" that. So, was that attempt at humour missed?
> ...

 

Not quite, but I got carried away by the implications. BTW, you came up with an excellent method for converting a Gentoo box into a fan heater, but you forgot the --keep-going option. It's getting cold here in Norway now.

----------

## Jaglover

... I have three boxes always on ... is that why my central AC unit just kicked in? Or maybe it's because it's 30C out there? Patting myself on the back for moving from Nordic to Tropic.   :Razz: 

----------

## khayyam

 *leifbk wrote:*   

> BTW, you came up with an excellent method for converting a Gentoo box into a fan heater, but you forgot the --keep-going option. It's getting cold here in Norway now.

 

leifbk ... what? and miss the opportunity to discover another bug, no ... and why not do as we more southern europeans do and throw another IKEA poang rocking chair with korndal brown cushion on the fire ... or is that too Swedish for Norwegian sensibilities?

best ... khay

----------

## anyNiXwilldo

Well I hadn't rebooted in several days, but I noticed this morning 3.6.2 was masked. I knew why, from yesterday's articles. The info making the rounds today was saying it's a rather esoteric (hard to reproduce) bug, which probably meant I had nothing to worry about. However, given I run almost 100% stable, except for things like qpdfview, nomacs and the kernel, I felt it best to back the kernel back down to stable from ~amd64. I umounted my data partition after building 3.5.4-hardened-r1-gnu, prior to rebooting with that kernel. Everything seems to be fine. I just know I don't have the nerves to deal with these newer kernels and whatever very scary bugs they might have.

----------

## platojones

Reading the latest updates at the thread below and considering the fact that this isn't showing up but on 2 machines that anybody knows of so far (nobody has been able to independently reproduce yet), I'd say it's looking very anti-climactic:

http://thread.gmane.org/gmane.linux.kernel/1379725/focus=1381772

----------

## leifbk

 *khayyam wrote:*   

>  ... and why not do as we more southern europeans do and throw another IKEA poang rocking chair with korndal brown cushion on the fire ... or is that too Swedish for Norwegian sensibilities?

 

We love to burn cheap Swedish furniture   :Very Happy: 

We still haven't forgiven the Swedes for Karl XII, who was shot through the head during his Norwegian campaign in 1718. Nobody knows for certain if the bullet was Norwegian or Swedish, but we love to claim the credit. This tends to make the Swedes irate.

----------

## ulenrich

 *platojones wrote:*   

> Reading the latest updates at the thread below and considering the fact that this isn't showing up but on 2 machines that anybody knows of so far (nobody has been able to independently reproduce yet), 

 

Yes, but it was not hardware related but setup:

cascading mounts of mixed ext4 and network devices, were it was forcefully configured to be able to very fast reboot: lazy "umount -l" was used to not wait for net devices. And a local machine ext4 partition was mounted on top of a net mount??? And "nobarrier" mount option??

In this special case _and_ if additionally some crash induced reboots then:

there was data loss after the second reboot!

A clean bit was set, when there hasn't been a journal cleanup yet (writeback?). A workaround for this setup would have been forcefsck in the boot cmdline. This would have played the capabilities of a journaled filesystem: The missing data would have been written back. But the additional forcefsck wouldn't quickly boot up the system ...

 :Smile:  not a very general used setup  :Smile: 

This is why Greg Kroah-Hartman doesn't quickly thin release to fix the issue. At first all of us who observe the stable patchlevel releases felt a panic attack  :Sad:  because we knew there had been an ext4 feature backport for linux-3.6.2 . But the jbd2 patch which obviously caused the data loss would have been attached any way: it was (thought) a fix. Greg Kroah-Hartman should serialize such feature backports to reduce our psycho panics.

[edit]Don't take the last sentence as a serious suggestion, but as a tool to self audit (for me at least).

[edit2]Because Greg does it already when possible.Last edited by ulenrich on Fri Oct 26, 2012 4:19 pm; edited 2 times in total

----------

## khayyam

FOR IMMEDIATE REPRODUCTION:

In a new development to the mysterious WTY2KSTUXNET&W98 bug a user discovered that by replaying comments from one source to another, A = Ad Infinitum, that the bug was able to transfer itself from A => B and that subsequently the user was able to raise the meaning threshhold, MSG, from DUH to OMG with an increase in YMMV. One news source is reported to have reported that in terms of reproducability this may have nothing to do with MSG itself, but advises that factors read from elsewhere might be subsequently reformulated exponentially. A detailed study of the distribution vector suggests there may be no end to this, but shortly thereafter a counter study, using the same data set, came to same conclusion, which has experts unsure as to which of the current studies is best suited for further research. In a new development a user discovered that by replaying comments from one source to another, A = Ad Infinitum, that there was a transference from A => B and that subsequently the user was able to raise the meaning threshhold, MSG, from DUH to OMG with an increase in YMMV, they went on to suggest that Greg Kroah-Hartman should serialize such feature backports to reduce our psycho panics.

YMMV & best ... khay

----------

## leifbk

It requires some weird settings, for sure. To quoute "nix" on the latest lwn.net discussion:

 *Quote:*   

> My latest tests indicate that unless you use journal_checksum or journal_async_commit (which implies journal_checksum), neither of which are the default, you appear to be safe. Unless you use nobarrier *too* (which you really shouldn't unless you have suitable hardware, which means battery-backed disk controllers: a PSU isn't good enough), you will see a journal abort and a readonly remount of the fs at next mount. You need both journal_async_commit *and* nobarrier to get no journal abort and silent fs corruption.

 

----------

## depontius

A while back when I saw "journal_checksum" on LWN and/or Phoronix, I thought that it looked like a nifty thing to improve data integrity.

How ironic that so far it has done just the opposite.

Thing is, I don't know if my one 3.6.2 system at home has it turned on, or not.  I would have been inclined to turn it on, but generally if the kernel developers advise against it in the help text, I don't.  I haven't had time lately to fiddle with that system.

--> I had that system up last night.  "grep EXT4 /usr/src/linux/.config" had only "regular" ext4 kernel options, and nothing about the journal checksum.  I guess I'm as safe as I ever was.  I've never been one to tweak with mount options, beyond safe stuff like "relatime".

----------

## StormTiberius

Well, i had a new fresh gentoo and decided to put newest kernel (3.6.2) since heck i was not gonna update the kernel after initial install anytime soon if ever. So everything was running fine and i read upon the kernel ext4 bug well ok i just decided to not shutdown my computer until new kernel was released but no... the next day after the bug announcement it started to rain snow and what do you know raining snow caused power outage and now i have corrupted home partition not sure about root but home partition is corrupted no doubt about it and they claimed this bug is rare with rare options besides discard option because i had SSD i had no other special options and now i am sitting down fiddling with my fingers waiting for the UPS device to arrive so i can reinstall system without worrying further corruption.

----------

## khayyam

 *StormTiberius wrote:*   

> [...] and now i have corrupted home partition not sure about root but home partition is corrupted no doubt about it and they claimed this bug is rare with rare options [...]

 

StormTiberius ... and so you are *sure* that this is related to this, so called, "bug"? So far there has been no evidence to show that the bug causes snow, power failures, or that it is the *only* cause of filesystem corruption.

As all the evidence points to this "bug" then you are now the second person to be hit, congratuations.

best ... khay

----------

## anyNiXwilldo

Fixed, but I will continue with the stable kernels regardless. Time to mark the topic 'SOLVED.'

----------

