# [Solved] Filesystem randomly goes read-only

## andmag

Hi,

Since about a week ago I have started to experience random look-ups on my home server. I seems to just be a matter of (up)time for it to occur. Last occasion was after about 15 minutes of emerging (gcc, in this case, but that is irrelevant). Emerging ends in an error of the type "Fatal error: can't write ...: Read-only file system". After this, nothing works. "ls" segfaults, "mount" command not found. Cannot reboot or anything.

I have tried to check the disks without finding any error (fsck.ext4 and SMART). Logs are of course empty, since the file system has gone read-only. I have ran memory tests - no problems. After rebooting the server will start up normally - just a bit of extra journalling. File systems are ext4. The affected drive is an SSD. Kernel 3.1.6

Anyone out there with the same problem? Can it be traced back to any recent update? I think there were recent updates of e2fsprogs and some others, but to me this goes deeper than an update to utility programs.

Regards,

    AndreasLast edited by andmag on Sat Mar 24, 2012 7:09 pm; edited 1 time in total

----------

## Hu

The filesystem can only be remounted read-only by the kernel or a privileged user program.  I am not aware of any programs that arbitrarily remount read-only like that.  What does dmesg say after the failure?  If dmesg cannot be run, you could try looking at the last lines printed on the system console.

----------

## EatMeerkats

The same thing was happening to me recently when I had a loose SATA cable… check dmesg, if possible, for some messages about ATA resets?

----------

## gentoo_ram

This is mostly likely the problem.  If there are errors reading and writing to devices under the filesystem, the kernel will automatically re-mount the filesystem read-only to help prevent more corruption.  You either have a hard drive going bad or there are issues with the cabling going to the drive or the kernel driver is having problems talking with the device.

----------

## ferreirafm

 *gentoo_ram wrote:*   

> If there are errors reading and writing to devices under the filesystem, the kernel will automatically re-mount the filesystem read-only to help prevent more corruption.

 

That's very likely and also can happen when your disk/partition get full!!! You can check disk problems with the program smart.

G'Luck

----------

## Hu

 *ferreirafm wrote:*   

>  *gentoo_ram wrote:*   If there are errors reading and writing to devices under the filesystem, the kernel will automatically re-mount the filesystem read-only to help prevent more corruption. 
> 
> That's very likely and also can happen when your disk/partition get full!

 Could you provide a citation for this claim?  Production quality filesystems should gracefully handle a disk full condition.

----------

## ferreirafm

Hi Hu,

 *Hu wrote:*   

> Could you provide a citation for this claim?

  Not properly, however, that was my experience when my backuppc system automatically re-mount the filesystem read-only after the disks got full. Considering that SMART couldn't be able to point any disk fail and that removing older huge tar files have brought things back to normal, I only could think in this possibility. But it mightn't be the andmag's problem. I agree, however, that mount manual does not mention anything about that.

 *Hu wrote:*   

> Production quality filesystems should gracefully a disk full condition.

 In my modest opinion, mount read-only would be a clever way to handle with a disk full situation.

----------

## Hu

 *ferreirafm wrote:*   

>  *Hu wrote:*   Production quality filesystems should gracefully a disk full condition. In my modest opinion, mount read-only would be a clever way to handle with a disk full situation.

 Unfortunately, this is not a suitable response on a multi-user system.  If quotas are not active, then any user can fill[1] the filesystem if he can write to any directory in it.  An automatic remount read-only on filling the filesystem would then deny service to other users, even if they had sufficient space and only wanted to modify files that had been preallocated.  The better approach is to return an error code to the process which exhausted the final block, so that it can detect the problem.

[1] Some filesystems reserve space to the root user.  However, this can be disabled.  Additionally, it is not clear whether you meant to consider the filesystem full when the unprivileged space is exhausted or to consider it full when there are no unallocated blocks at all.

----------

## andmag

Hi,

I have made some more tests. It does seem like I have some form of transmission errors from the drive. Although I cannot run dmesg after the problem has occured, I had a look at the system console, as Hu suggested, and there I could see that sda had an "Unhandled error code". Then I have a kernel panic and a call trace is listed.

I have tried to detach and reattach the SATA cable, but this has not helped. Next thing to try is to replace the cable.

It should not be a problem related to the disk getting full, as there are plenty of space left on the device.

I will continue to investigate the problem.

----------

## eccerr0r

Was this a newly assembled system or was this something that popped up?

I've had this happen on many of my machines in the past, they all pointed to hardware issues.  An easy way to replicate is to unplug the root harddrive (either power or data cable) while it's running :) (don't do this at home kids, this can cause serious damage).  The kernel notices problems with the disk while writing metadata and force remounts the afflicted disk readonly, and then gives up when it can't read additional binaries from the disk.  At that point you can't even run dmesg because the disk is offline (unless your shell interpreter has dmesg built-in, perhaps if you were running busybox or something like that).  I had something like this happen with a bad SATA controller.  That's another possibility.

Though I have not had much experience with SSDs however but this does show up with regular magnetic hard drive assemblies.

The kernel will not remount readonly if the disk is out of space.  There's no risk of corruption (like inodes/filesystem metadata) in this case, but there will be data loss of course...

----------

## andmag

Hi again,

I have now had time to do some further investigations. It seems the problem is not that random; three times in a row it occured almost exactly one hour after boot! First time I could not get any information as to what went wrong, but the two other times I managed to run "dmesg" and redirect its output to another harddrive. Shown below are some lines from "dmesg":

```

[   12.383638] Installing knfsd (copyright (C) 1996 ...).

[ 3611.696051] ata1.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x6 frozen

[ 3611.696072] ata1.00: cmd 60/00:00:bc:b0:9c/02:00:05:00:00/40 tag 0 ncq 262144 in

[ 3611.696075]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

[ 3611.696092] ata1.00: cmd 60/00:08:bc:b2:9c/02:00:05:00:00/40 tag 1 ncq 262144 in

[ 3611.696095]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

[ 3611.696108] ata1: hard resetting link

[ 3617.050030] ata1: link is slow to respond, please be patient (ready=0)

[ 3621.742032] ata1: COMRESET failed (errno=-16)

[ 3621.742043] ata1: hard resetting link

[ 3627.096028] ata1: link is slow to respond, please be patient (ready=0)

[ 3631.788029] ata1: COMRESET failed (errno=-16)

[ 3631.788040] ata1: hard resetting link

[ 3637.142032] ata1: link is slow to respond, please be patient (ready=0)

[ 3666.824031] ata1: COMRESET failed (errno=-16)

[ 3666.824043] ata1: limiting SATA link speed to 1.5 Gbps

[ 3666.824050] ata1: hard resetting link

[ 3671.872033] ata1: COMRESET failed (errno=-16)

[ 3671.872044] ata1: reset failed, giving up

[ 3671.872051] ata1.00: disabled

[ 3671.872060] ata1.00: device reported invalid CHS sector 0

[ 3671.872068] ata1.00: device reported invalid CHS sector 0

[ 3671.872090] ata1: EH complete

[ 3671.872159] sd 0:0:0:0: [sda] Unhandled error code

[ 3671.872168] sd 0:0:0:0: [sda]  

[ 3671.872180] sd 0:0:0:0: [sda] Unhandled error code

[ 3671.872194] sd 0:0:0:0: [sda]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

[ 3671.872219] sd 0:0:0:0: [sda] CDB: Read(10)Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

[ 3671.872249] sd 0:0:0:0: [sda] CDB: Write(10): 28 00 05 9c b2 bc 00 02 00 00

[ 3671.872296] end_request: I/O error, dev sda, sector 94155452

[ 3671.872312] : 2a 00 03 92 d4 c4 00 00 08 00

[ 3671.872330] end_request: I/O error, dev sda, sector 59954372

[ 3671.872340] Buffer I/O error on device sda3, logical block 6956119

[ 3671.872346] lost page write due to I/O error on sda3

[ 3671.872361] sd 0:0:0:0: [sda] Unhandled error code

[ 3671.872375] sd 0:0:0:0: [sda]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

[ 3671.872398] sd 0:0:0:0: [sda] CDB: Read(10): 28 00 05 9c b0 bc 00 02 00 00

[ 3671.872441] end_request: I/O error, dev sda, sector 94154940

[ 3671.872511] sd 0:0:0:0: [sda] Unhandled error code

[ 3671.872516] sd 0:0:0:0: [sda]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

[ 3671.872524] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 02 05 bd f4 00 00 68 00

[ 3671.872543] end_request: I/O error, dev sda, sector 33930740

[ 3671.872569] sd 0:0:0:0: [sda] Unhandled error code

[ 3671.872576] sd 0:0:0:0: [sda]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

[ 3671.872584] sd 0:0:0:0: [sda] CDB: Read(10): 28 00 05 9c b0 bc 00 00 08 00

[ 3671.872603] end_request: I/O error, dev sda, sector 94154940

[ 3671.872651] sd 0:0:0:0: [sda] Unhandled error code

[ 3671.872658] sd 0:0:0:0: [sda]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

[ 3671.872665] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 03 85 e4 dc 00 00 08 00

[ 3671.872684] end_request: I/O error, dev sda, sector 59106524

[ 3671.872693] Buffer I/O error on device sda3, logical block 6850138

[ 3671.872705] EXT4-fs warning (device sda3): ext4_end_bio:258: I/O error writing to inode 1714576 (offset 0 size 4096 starting block 7388316)

[ 3671.872764] sd 0:0:0:0: [sda] Unhandled error code

[ 3671.872772] sd 0:0:0:0: [sda]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

[ 3671.872783] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 03 98 32 a4 00 00 08 00

[ 3671.872807] end_request: I/O error, dev sda, sector 60306084

[ 3671.872819] Buffer I/O error on device sda3, logical block 7000083

[ 3671.872845] EXT4-fs warning (device sda3): ext4_end_bio:258: I/O error writing to inode 1729540 (offset 0 size 4096 starting block 7538261)

[ 3671.872860] sd 0:0:0:0: [sda] Unhandled error code

[ 3671.872868] sd 0:0:0:0: [sda]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

[ 3671.872878] sd 0:0:0:0: [sda] CDB: 

[ 3671.872883] JBD2: Detected IO errors while flushing file data on sda3-8

[ 3671.872889] Read(10): 28 00 05 9c b0 bc 00 00 08 00

[ 3671.872911] end_request: I/O error, dev sda, sector 94154940

[ 3671.872931] Aborting journal on device sda3-8.

[ 3671.872995] EXT4-fs error (device sda3) in ext4_reserve_inode_write:4126: Journal has aborted

[ 3671.873027] sd 0:0:0:0: [sda] Unhandled error code

[ 3671.873035] sd 0:0:0:0: [sda]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

[ 3671.873045] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 02 05 b2 0c 00 00 08 00

*

*

*

```

The error is repeated a significant number of times.

Not that I feel much more enlightened, but its non-random nature does point to a software problem. At least to me.

By the way, I did try to exchange the SATA cable, but to no avail.

 *Quote:*   

> 
> 
> Was this a newly assembled system or was this something that popped up? 
> 
> 

 

No, its been running for about a year.

----------

## windscape

Hi,

One thought is that the flash blocks on the SSD occupied by the root file system could be wearing out. They do have a limited number of write cycles and if all of the spare flash blocks have been re-mapped, then that could explain this issue. 

Perhaps your SSD manufacturer has some sort of utility to read the wear level of the drive, similar to what the OCZ Toolbox can do (unfortunately only available for Windows). Some SSD manufacturers indicate that sort of information via SMART attributes, but smartctl may not be able to interpret them without command-line options specific to your SSD.

----------

## Hu

 *andmag wrote:*   

> Not that I feel much more enlightened, but its non-random nature does point to a software problem. At least to me.

 I interpret this to be a deterministic hardware error that is provoked by a particular software action.  The offending action is scheduled to occur approximately one hour after the system starts.

----------

## andmag

Hi,

I have most likely found the cause of my problem! It turns out to be a firmware related error.

http://www.tomshardware.com/news/Crucial-m4-Firmware-BSOD,14544.html

My drive has an on-time of 217.7days, which is too close to the 5184 hours mentioned in the article to just be a coincidence. Now I just have to find a box from which I can update the firmware.

Cheers!

----------

