# [Solved] ReiserFS bug or dieing harddrive?

## Crunchy

I've been getting some problems on one of my drives, an app on X will suddenly freeze and not do anything for a few minutes then suddenly come back to life.  I also noticed occasional errors being sorted out on the drive at boot so decided to look into it.

At the grub menu I chose the rescue option which loads up BusyBox.  When I ran reiserfsck on the drive I got these errors (I couldn't capture the output to a file so I wrote it down):

```
ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x380000 action 0x2

ata4.00: (BMDMH stat 0x24)

ata4.00: cmd c8/00: lf:

<snip>

Emask 0x10 (ATA bus error)

ata4.00: soft resetting port
```

This errorr or similar errors would repeat over and over for several minutes and eventually reiserfsck would complete with "No Corruptions Found".  I repeated the reiserfsck several times to try and capture the output.  Eventually it gave up saying that there was probably a hardware fault, I did capture that output but its on a linux partition so I'll have to add that later.

The weird thing is that if I leave my computer off for a little while the drive will run fine for a while (perhaps days) and then suddenly start producing these problems.

Is there any further tests I can do to make sure it's the drive at fault and not something else such as ReiserFS?

I'm using the ReiserFS included in the latest stable gentoo-sources.  The other two drives are running no problems with ReiserFS on them.  I've seen the drive completely disapear from the BIOS startup which seems to support the theory that the drive is failing.  I have a couple of windows partitions on the problem drive and haven't noticed any major problem.Last edited by Crunchy on Sat Dec 08, 2007 5:34 am; edited 1 time in total

----------

## lonrot_m

There should be some test sections in your bios that can check the "health" of the disk, also there are several utilities to check it, ill post them later.

----------

## Crunchy

Here's the output from reiserfsck I've had a couple of times after running it a few times:

```
reiserfsck 3.6.19 (2003 www.namesys.com)

*************************************************************

** If you are using the latest reiserfsprogs and  it fails **

** please  email bug reports to reiserfs-list@namesys.com, **

** providing  as  much  information  as  possible --  your **

** hardware,  kernel,  patches,  settings,  all reiserfsck **

** messages  (including version),  the reiserfsck logfile, **

** check  the  syslog file  for  any  related information. **

** If you would like advice on using this program, support **

** is available  for $25 at  www.namesys.com/support.html. **

*************************************************************

Will read-only check consistency of the filesystem on /dev/sdc6

Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):

The problem has occurred looks like a hardware problem. If you have

bad blocks, we advise you to get a new hard drive, because once you

get one bad block  that the disk  drive internals  cannot hide from

your sight,the chances of getting more are generally said to become

much higher  (precise statistics are unknown to us), and  this disk

drive is probably not expensive enough  for you to you to risk your

time and  data on it.  If you don't want to follow that follow that

advice then  if you have just a few bad blocks,  try writing to the

bad blocks  and see if the drive remaps  the bad blocks (that means

it takes a block  it has  in reserve  and allocates  it for use for

of that block number).  If it cannot remap the block,  use badblock

option (-B) with  reiserfs utils to handle this block correctly.

bread: Cannot read the block (2): (Input/output error).

Aborted
```

I don't have anything in my BIOS to test disks but I'll download the SeaTools for Windows (it's a Seagate drive) and see if it reports anything.

----------

## jexxie

You can emerge the sys-apps/smartmontools to check your disk health.

Here's the commands of note that you'll want:

```
smartctl -a /dev/hda

smartctl -H /dev/hda

smartctl -t short /dev/hda

smartctl -l selftest /dev/hda
```

----------

## Crunchy

Thanks Jekkie, I'll try those later.  I just ran some tests using SeaTools and they passed: Short Drive Self Test, Long DST, Short Generic.  Trying Long Generic now (when they say long they mean it  :Wink: .

----------

## Crunchy

SMART info:

```
smartctl version 5.37 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen

Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===

Model Family:     Seagate Barracuda 7200.7 and 7200.7 Plus family

Device Model:     ST3200822AS

Serial Number:    3LJ2D7FR

Firmware Version: 3.01

User Capacity:    200,049,647,616 bytes

Device is:        In smartctl database [for details use: -P show]

ATA Version is:   6

ATA Standard is:  ATA/ATAPI-6 T13 1410D revision 2

Local Time is:    Sat Nov 24 18:30:59 2007 GMT

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status:  (0x82) Offline data collection activity

                                        was completed without error.

                                        Auto Offline Data Collection: Enabled.

Self-test execution status:      (   0) The previous self-test routine completed

                                        without error or no self-test has ever 

                                        been run.

Total time to complete Offline 

data collection:                 ( 430) seconds.

Offline data collection

capabilities:                    (0x5b) SMART execute Offline immediate.

                                        Auto Offline data collection on/off support.

                                        Suspend Offline collection upon new

                                        command.

                                        Offline surface scan supported.

                                        Self-test supported.

                                        No Conveyance Self-test supported.

                                        Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

                                        power-saving mode.

                                        Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

                                        No General Purpose Logging support.

Short self-test routine 

recommended polling time:        (   1) minutes.

Extended self-test routine

recommended polling time:        ( 111) minutes.

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000f   058   051   006    Pre-fail  Always       -       59580979

  3 Spin_Up_Time            0x0003   096   096   000    Pre-fail  Always       -       0

  4 Start_Stop_Count        0x0032   099   099   020    Old_age   Always       -       1622

  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       1

  7 Seek_Error_Rate         0x000f   089   060   030    Pre-fail  Always       -       925394117

  9 Power_On_Hours          0x0032   081   081   000    Old_age   Always       -       17394

 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0

 12 Power_Cycle_Count       0x0032   099   099   020    Old_age   Always       -       1609

194 Temperature_Celsius     0x0022   031   047   000    Old_age   Always       -smartinfo.text lines 46-59/183 32%

       31

195 Hardware_ECC_Recovered  0x001a   058   051   000    Old_age   Always       -       59580979

197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x003e   200   125   000    Old_age   Always       -       462

200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0

202 TA_Increase_Count       0x0032   100   253   000    Old_age   Always       -       0

SMART Error Log Version: 1

ATA Error Count: 626 (device log contains only the most recent five errors)

        CR = Command Register [HEX]

        FR = Features Register [HEX]

        SC = Sector Count Register [HEX]

        SN = Sector Number Register [HEX]

        CL = Cylinder Low Register [HEX]

        CH = Cylinder High Register [HEX]

        DH = Device/Head Register [HEX]

        DC = Device Command Register [HEX]

        ER = Error register [HEX]

        ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 626 occurred at disk power-on lifetime: 17358 hours (723 days + 6 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  84 51 00 af 1f 80 e7  Error: ICRC, ABRT at LBA = 0x07801faf = 125837231

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  c8 00 08 a8 1f 80 e7 00      00:47:34.229  READ DMA

  ca 00 07 19 20 80 e7 00      00:47:34.228  WRITE DMA

  ca 00 01 18 20 80 e7 00      00:47:34.219  WRITE DMA

  25 00 07 d9 1f 3c e0 00      00:47:34.218  READ DMA EXT

  25 00 01 d8 1f 3c e0 00      00:47:34.218  READ DMA EXT

Error 625 occurred at disk power-on lifetime: 17358 hours (723 days + 6 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  84 51 00 50 23 90 e0  Error: ICRC, ABRT at LBA = 0x00902350 = 9446224

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 80 d1 22 90 e0 00      00:47:32.900  READ DMA EXT

  25 00 01 d0 22 90 e0 00      00:47:32.899  READ DMA EXT

  c8 00 07 09 34 49 eb 00      00:47:32.896  READ DMA

  c8 00 01 08 34 49 eb 00      00:47:32.896  READ DMA

  c8 00 07 d1 1d 49 eb 00      00:47:32.889  READ DMA

Error 624 occurred at disk power-on lifetime: 17358 hours (723 days + 6 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  84 51 00 77 26 58 ef  Error: ICRC, ABRT at LBA = 0x0f582677 = 257435255

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  c8 00 07 71 26 58 ef 00      00:47:31.516  READ DMA

  c8 00 01 70 26 58 ef 00      00:47:31.516  READ DMA

  c8 00 07 79 26 58 ef 00      00:47:31.515  READ DMA

  c8 00 01 78 26 58 ef 00      00:47:31.530  READ DMA

  c8 00 07 29 21 58 ef 00      00:47:31.530  READ DMA

Error 623 occurred at disk power-on lifetime: 17358 hours (723 days + 6 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  84 51 00 a7 24 ec ea  Error: ICRC, ABRT at LBA = 0x0aec24a7 = 183248039

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  c8 00 3f 69 24 ec ea 00      00:47:29.148  READ DMA

  c8 00 01 68 24 ec ea 00      00:47:29.148  READ DMA

  c8 00 1f 49 24 ec ea 00      00:47:29.139  READ DMA

  c8 00 01 48 24 ec ea 00      00:47:29.138  READ DMA

  c8 00 07 41 24 ec ea 00      00:47:29.129  READ DMA

Error 622 occurred at disk power-on lifetime: 17358 hours (723 days + 6 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  84 51 00 0f 2c f5 e0  Error: ICRC, ABRT at LBA = 0x00f52c0f = 16067599

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 07 09 2c f5 e0 00      00:47:27.177  READ DMA EXT

  25 00 07 09 2c f5 e0 00      00:47:27.177  READ DMA EXT

  25 00 01 08 2c f5 e0 00      00:47:27.173  READ DMA EXT

  25 00 07 81 0c f5 e0 00      00:47:27.173  READ DMA EXT

  25 00 01 80 0c f5 e0 00      00:47:27.160  READ DMA EXT

  25 00 07 39 ec f4 e0 00      00:47:27.159  READ DMA EXT

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline    Completed without error       00%     17364         -

# 2  Short offline       Completed without error       00%     17363         -

SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.
```

SMART health:

```
smartctl version 5.37 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen

Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED
```

SMART short self test:

```
smartctl version 5.37 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen

Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline       Completed without error       00%     17394         -

# 2  Extended offline    Completed without error       00%     17364         -

# 3  Short offline       Completed without error       00%     17363         -
```

There's errors in the log but the self tests pass no problem and the status is shown as healthy.  It's strange this only happens with this one drive (of three) but the drive reports as healthy.  I used the drive to play games for the last couple of days under Windows no problems.  Any ideas what is causing this?

----------

## Crunchy

This issue reared it's ugly head again.  I came back to my PC to find it using 100% CPU.  I think the drive was producing errors and KDE was going nuts because of it.   I killed X and rebooted.

To my horror there was some errors on bootup but KDM started.  I tried logging in and first I was kicked back to KDM and the second time I logged in but session settings had reset and Ktorrent was saying it couldn't find 100s of files.

I restarted and booted into busybox.  I fixed some errors on /home with reiserfsck --fix-fixable.  I did the same on /var.  However when I rebooted /var was still erroring.  This time it said to do reiserfsck --rebuild-tree, so I did.  Now it errored saying out of space error!

This was going from bad to worse, /var couldn't be rebuilt as it was full but I couldn't delete any files unless I rebuilt it.  I googled on the error and found that the latest reiserfsprogs fixes this.  I then found it was hard masked in /portage  :Sad: .  I used my LFS Live CD as that had the latest version on it and that rebuilt /var fine.

When I checked /var I found that I had a 2.6GB /var/log/messages!  Holy crap  :Very Happy: .  I'd also lost a lot of files on /home.

I'm left in a very akward position.  I don't want to replace the drive as windows has no problems with it.  I run games and store documents on that drive no problems at all.  I can only think that my nForce 3 mobo has a buggy SATA controller that either the kernel or reiserfs can't deal with.  It's odd though as it only happens to one of the drives.  I can't run linux like this, I can't afford to lose files all the time.

I've lost confidence in reiserfs, the reiserfsck is bugged for sure, one version is bugged if the drive is full and the latest version is hard masked for other reasons.  I think I'll move to 2.6.23 kernel and use ext3 instead.

----------

## Crunchy

Hmm, I do seem to have a problem with it in windows!  I was playing ET:QW when the FPS suddenly went very bad, I rebooted and went into Linux and had problems with the drives LVM/ext3 volumes.

I just had the high framerate in ET:QW again and quickly quit before it crashed, when I checked task manager there was 40+% cpu usage but it said all processes were idle.  That's something I've never seen before, 40% cpu usuage but 99% idle  :Wink: .

I'll replace the hdd.

----------

## Crunchy

I was looking at this again, it seems the errors in the logs were telling.  OK the self-tests were passing but the errors seems to indicate that commands were being aborted because of failed CRC checks.  IE the drive was dieing (and now dead  :Smile: .

----------

## avnielknight

I had a raid 0+1 running on reiserfs totally die on me last week (tried fixing it, even rebuilding the super block to no avail). I did a lot of checks on the driver's health and they came out perfect.

I formatted the drives as ext3 and have yet to encounter any issues thus so far.

----------

