# Is my drive dying?

## m27315

Hi,

Whenever I run rsnapshot at night, the hard-drive begins to act up.  rsnapshot and other cron jobs report errors like so:

```
rsync: writefd_unbuffered failed to write 4 bytes [sender]: Broken pipe (32)

rsync: write failed on "/var/www/.snapshots/daily.0/localhost/var/www/mywebsite.org/htdocs/audio/mymovie.mkv": Read-only file system (30)

rsync error: error in file IO (code 11) at receiver.c(298) [receiver=3.0.4]

rsync: connection unexpectedly closed (163226 bytes received so far) [sender]

rsync error: error in rsync protocol data stream (code 12) at io.c(632) [sender=3.0.4]

----------------------------------------------------------------------------

rsnapshot encountered an error! The program was invoked with these options:

/usr/bin/rsnapshot daily 

----------------------------------------------------------------------------

ERROR: /usr/bin/rsync returned 12 while processing /var/www/

Could not open logfile /var/log/rsnapshot for writing

Do you have write permission for this file?

/root/bin/backup_mysql.sh: line 4: /root/mydatabase.sql: Read-only file system

chmod: cannot access `/root/mydatabase.sql': No such file or directory

mv: cannot stat `/root/mydatabase.sql': No such file or directory

chown: changing ownership of `/home/user1/my.sql': Read-only file system

bzip2: Can't create output file /home/user1/my.sql.bz2: Read-only file system.

mv: cannot stat `/home/user1/my.sql.bz2': No such file or directory
```

Afterward, I am able to SSH into the box, and I can look around.  But, many commands fail to execute (like shutdown), and listing of random drive contents report an IO error.  Reboot seems to be the only recovery method.

The machine seems to behave normally otherwise.  And, I have let the box do a e2fsck at boot-up, which does find and correct errors.

I am guessing HDD pre-fail.  Any other thoughts?

(BTW, for some reason SMART was not activated on the drive.  I have activated it, and smartmontools is reporting PASS, although it does have a few errors logged.   :Rolling Eyes:  )

As a side thought, I have never used RAID0 for a Linux box before now.  Is it worth it?  How much of a pain is it to setup and swap out disks?  Would you recommend LVM, hardware RAID, or some other method?

Thanks!!!

----------

## NeddySeagoon

m27315,

Raid levels are easy to set up and easy to swap out disks. I recommend you set up a raid in degraded mode, then add the last drive so you get to try out the process. raid0 does not have any redundancy, so loss of a drive gives total data loss too. Thats why you have backups.

I can't tell if you have a drive problem from your post, it could also be a motherboard or data cable problem too.

You should find more information (about your drive) in dmesg and in smartmontools after the failure events.

----------

## energyman76b

I would replace the cabling - and if that doesn't help, replace the harddisk.

And I wouldn't touch Raid0 with a ten feet pole. Raid0 is only for people who don't care about their data. Harddisks do fail. The more harddisks, the more likely that one fails. One fails, everything is gone.

If you care about data: Raid1. It is 'striping' read accesses, so reading is sped up. Or Raid 5. Which is the best combination of speed and data redundancy.

----------

## m27315

Thanks for the help guys!  (BTW, Sorry to hijack the thread.  I intended to start a new thread, but clearly I posted a relpy instead of starting a new topic.  Maybe a moderator can split the appropriate posts off, if that's best?)

Yes, I will definitely try RAID1.  I want robustness now a days.  I am too old to stress over the other.   :Smile: 

I will try swapping the cables next time this happens.  Incidentally, I have disabled rsnapshot in cron, and this happens within 24-hours with all my cron jobs disabled.  rsnapshot simply produces the problem immediately, but whatever the root problem is, it causes the system to go bananas in less than 24 hours.

If I try to list the contents of the root directory, I get:

```
ls -latr /

ls: cannot access /lib32: Input/output error

ls: cannot access /mnt: Input/output error

total 68

d?????????  ? ?    ?        ?                ? mnt

d?????????  ? ?    ?        ?                ? lib32

drwx------  2 root root 16384 2007-01-05 01:14 lost+found

drwxr-xr-x 13 root root  4096 2008-06-17 13:51 var

drwxr-xr-x 14 root root  4096 2009-01-03 19:19 usr

drwxr-xr-x  3 root root  4096 2009-01-04 00:21 opt

drwxr-xr-x  2 root root  4096 2009-01-04 01:27 boot

lrwxrwxrwx  1 root root     5 2009-01-04 17:31 lib -> lib64

drwxr-xr-x  9 root root  4096 2009-01-04 17:55 lib64

drwxr-xr-x  2 root root  4096 2009-01-05 20:45 bin

drwxr-xr-x  4 root root  4096 2009-01-08 11:41 home

dr-xr-xr-x 87 root root     0 2009-01-10 15:25 proc

drwxr-xr-x 12 root root     0 2009-01-10 15:25 sys

drwxr-xr-x  2 root root  4096 2009-01-10 20:14 sbin

drwxr-xr-x 13 root root  3480 2009-01-10 21:25 dev

drwxr-xr-x 45 root root  4096 2009-01-10 21:34 etc

drwxr-xr-x 19 root root  4096 2009-01-10 21:34 ..

drwxr-xr-x 19 root root  4096 2009-01-10 21:34 .

drwx------  7 root root  4096 2009-01-10 22:13 root

drwxrwxrwt  4 root root  4096 2009-01-11 05:20 tmp
```

See anything weird?

Here's the SMART data:

```
 smartctl -a /dev/sda

smartctl version 5.38 [x86_64-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen

Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===

Model Family:     Seagate Barracuda 7200.11

Device Model:     ST3500320AS

Serial Number:    5QM01WWW

Firmware Version: SD04

User Capacity:    500,107,862,016 bytes

Device is:        In smartctl database [for details use: -P show]

ATA Version is:   7

ATA Standard is:  Exact ATA specification draft version not indicated

Local Time is:    Sun Jan 11 13:55:45 2009 CST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status:  (0x82) Offline data collection activity

                                        was completed without error.

                                        Auto Offline Data Collection: Enabled.

Self-test execution status:      (  25) The self-test routine was aborted by

                                        the host.

Total time to complete Offline 

data collection:                 ( 642) seconds.

Offline data collection

capabilities:                    (0x7b) SMART execute Offline immediate.

                                        Auto Offline data collection on/off support.

                                        Suspend Offline collection upon new

                                        command.

                                        Offline surface scan supported.

                                        Self-test supported.

                                        Conveyance Self-test supported.

                                        Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

                                        power-saving mode.

                                        Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

                                        General Purpose Logging supported.

Short self-test routine 

recommended polling time:        (   1) minutes.

Extended self-test routine

recommended polling time:        ( 106) minutes.

Conveyance self-test routine

recommended polling time:        (   2) minutes.

SCT capabilities:              (0x003b) SCT Status supported.

                                        SCT Feature Control supported.

                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000f   112   088   006    Pre-fail  Always       -       218496749

  3 Spin_Up_Time            0x0003   094   092   000    Pre-fail  Always       -       0

  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       475

  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       2042

  7 Seek_Error_Rate         0x000f   078   060   030    Pre-fail  Always       -       4356327589

  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       4134

 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       1

 12 Power_Cycle_Count       0x0032   100   037   020    Old_age   Always       -       486

184 Unknown_Attribute       0x0032   100   100   099    Old_age   Always       -       0

187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       1087

188 Unknown_Attribute       0x0032   098   098   000    Old_age   Always       -       4295032849

189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0

190 Airflow_Temperature_Cel 0x0022   070   050   045    Old_age   Always       -       30 (Lifetime Min/Max 30/30)

194 Temperature_Celsius     0x0022   030   050   000    Old_age   Always       -       30 (0 16 0 0)

195 Hardware_ECC_Recovered  0x001a   047   024   000    Old_age   Always       -       218496749

197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       5

198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       5

199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1

ATA Error Count: 1128 (device log contains only the most recent five errors)

        CR = Command Register [HEX]

        FR = Features Register [HEX]

        SC = Sector Count Register [HEX]

        SN = Sector Number Register [HEX]

        CL = Cylinder Low Register [HEX]

        CH = Cylinder High Register [HEX]

        DH = Device/Head Register [HEX]

        DC = Device Command Register [HEX]

        ER = Error register [HEX]

        ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1128 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  04 71 04 9d 00 32 e0  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  a1 00 00 00 00 00 a0 02      07:57:48.684  IDENTIFY PACKET DEVICE

  ec 00 00 00 00 00 a0 02      07:57:48.661  IDENTIFY DEVICE

  00 00 00 00 00 00 00 06      07:57:48.501  NOP [Abort queued commands]

  a1 00 00 00 00 00 a0 02      07:57:43.194  IDENTIFY PACKET DEVICE

  ec 00 00 00 00 00 a0 02      07:57:43.171  IDENTIFY DEVICE

Error 1127 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  04 71 04 9d 00 32 e0

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  ec 00 00 00 00 00 a0 02      07:57:48.661  IDENTIFY DEVICE

  00 00 00 00 00 00 00 06      07:57:48.501  NOP [Abort queued commands]

  a1 00 00 00 00 00 a0 02      07:57:43.194  IDENTIFY PACKET DEVICE

  ec 00 00 00 00 00 a0 02      07:57:43.171  IDENTIFY DEVICE

  00 00 00 00 00 00 00 06      07:57:43.014  NOP [Abort queued commands]

...

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline       Completed without error       00%      4126         -

SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.
```

Here are the references to /dev/sda in dmesg:

```
[    3.270520] scsi 0:0:0:0: Direct-Access     ATA      ST3500320AS      SD04 PQ: 0 ANSI: 5

[    3.271020] sd 0:0:0:0: [sda] 976773168 512-byte hardware sectors (500108 MB)

[    3.271184] sd 0:0:0:0: [sda] Write Protect is off

[    3.271334] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00

[    3.271357] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

[    3.271698] sd 0:0:0:0: [sda] 976773168 512-byte hardware sectors (500108 MB)

[    3.271861] sd 0:0:0:0: [sda] Write Protect is off

[    3.272011] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00

[    3.272032] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

[    3.272310]  sda: sda1 sda2 sda3 sda4

[    3.277548] sd 0:0:0:0: [sda] Attached SCSI disk

...

[    8.749840] kjournald starting.  Commit interval 5 seconds

[    8.750002] EXT3-fs: sda3: orphan cleanup on readonly fs

[    8.750155] ext3_orphan_cleanup: deleting unreferenced inode 833960

[    8.750181] ext3_orphan_cleanup: deleting unreferenced inode 833959

[    8.750188] ext3_orphan_cleanup: deleting unreferenced inode 833958

[    8.750194] ext3_orphan_cleanup: deleting unreferenced inode 833957

[    8.750200] ext3_orphan_cleanup: deleting unreferenced inode 833956

[    8.750205] EXT3-fs: sda3: 5 orphan inodes deleted

[    8.750355] EXT3-fs: recovery complete.

[    8.755033] EXT3-fs: mounted filesystem with ordered data mode.

[    8.755200] VFS: Mounted root (ext3 filesystem) readonly.

[    8.755366] Freeing unused kernel memory: 480k freed

[    8.755713] Write protecting the kernel read-only data: 6040k

[    9.169288] khelper used greatest stack depth: 5216 bytes left

[    9.695106] stty used greatest stack depth: 4592 bytes left

[   10.095949] udevadm used greatest stack depth: 4520 bytes left

[   10.291137] usb usb2: uevent

[   10.291164] usb 2-0:1.0: uevent

[   10.291270] usb usb1: uevent

[   10.291289] usb 1-0:1.0: uevent

[   11.219301] EXT3 FS on sda3, internal journal

[   11.289742] sort used greatest stack depth: 4504 bytes left

[   12.542691] kjournald starting.  Commit interval 5 seconds

[   12.545857] EXT3 FS on sda4, internal journal

[   12.545861] EXT3-fs: mounted filesystem with ordered data mode.

[   12.643508] Adding 4008208k swap on /dev/sda2.  Priority:-1 extents:1 across:4008208k

[   15.339397] ps used greatest stack depth: 4448 bytes left

...
```

hdparm included for completeness:

```
hdparm /dev/sda

/dev/sda:

 IO_support    =  0 (default) 

 readonly      =  0 (off)

 readahead     = 256 (on)

 geometry      = 60801/255/63, sectors = 976773168, start = 0
```

Any more thoughts?  If I swap out the cables and the failure persists, would you guess HDD or MB?  Unfortunately, I don't have a spare to try in an experiment. ... I would guess HDD, because I had used it as my main drive in another computer for a year or so, with no problems, and its relatively new, but that doesn't mean much.

Do any of the above errors clearly suggest MB or HDD to you?

Thanks!!!

----------

## energyman76b

5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       2042 

that is pretty high. The error count too. Just try different cable ASAP (and don't even try fsck with a maybe broken cable, it can make everything much worse), if harddisk this misbehaves, i would put money on a bad disk.

----------

## energyman76b

example of a good harddisk:

 1 Raw_Read_Error_Rate     0x000f   100   100   051    Pre-fail  Always       -       0        

  3 Spin_Up_Time            0x0007   085   085   011    Pre-fail  Always       -       5440     

  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       466      

  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0        

  7 Seek_Error_Rate         0x000f   100   100   051    Pre-fail  Always       -       0        

  8 Seek_Time_Performance   0x0025   100   100   015    Pre-fail  Offline      -       9776     

  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       3347     

 10 Spin_Retry_Count        0x0033   100   100   051    Pre-fail  Always       -       0        

 11 Calibration_Retry_Count 0x0012   100   100   000    Old_age   Always       -       0        

 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       459      

 13 Read_Soft_Error_Rate    0x000e   100   100   000    Old_age   Always       -       0        

183 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0        

184 Unknown_Attribute       0x0033   100   100   099    Pre-fail  Always       -       0        

187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0        

188 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0        

190 Airflow_Temperature_Cel 0x0022   078   067   000    Old_age   Always       -       22 (Lifetime Min/Max 14/22)

194 Temperature_Celsius     0x0022   078   061   000    Old_age   Always       -       22 (Lifetime Min/Max 14/24)

195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       223953                     

196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0                          

197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0                          

198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0                          

199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0                          

200 Multi_Zone_Error_Rate   0x000a   100   100   000    Old_age   Always       -       0                          

201 Soft_Read_Error_Rate    0x000a   100   100   000    Old_age   Always       -       0

----------

## m27315

Well, it just died again - second time today!  Things are going downhill quick, it seems.  I swapped out the cable this time.  We'll see what happens.

You know, I have always had questions about the SMART data.  I always thought the "VALUE" column was the real, meaningful value that should be examined, whereas the "RAW_VALUE" column referred to the actual contents of the register, which could be inverted, bit-shifted, offset, etc, and which was basically useless without knowledge of how to interpret the data.  Maybe now is a good time for me to research that ...   :Rolling Eyes:  For example, in this row:

```
SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE 

...

  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       2042 
```

Does this mean that I have 100 sectors available for reallocation (good), and when I get down to 36 (bad), the drive is exhausted of available sectors?  Or, does it mean that the drive has already reallocated 2042 sectors (very bad)?  I tend toward the first one, but I need to research this...  I have never known how to really interpret this data.

Thanks again for the help and the comparison data - that is very helpful.

----------

## energyman76b

http://www.t13.org/Documents/UploadedDocuments/docs2005/e05148r0-ACS-SMARTAttributesAnnex.pdf

http://smartmontools.sourceforge.net/faq.html

----------

## NeddySeagoon

m27315,

The drive is dying:-

```
  1 Raw_Read_Error_Rate     0x000f   112   088   006    Pre-fail  Always       -       218496749 

195 Hardware_ECC_Recovered  0x001a   047   024   000    Old_age   Always       -       218496749 
```

its generating a lot of errors and working hard to correct them - those two numbers are both very high (bad) and identical (good), so its winning at the moment

```
Seek_Error_Rate         0x000f   078   060   030    Pre-fail  Always       -       4356327589
```

is a bad sign and your 

```
Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       2042 
```

is high.

There have been no errors on the interface 

```
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
```

so thats good.

The drive is working hard to provide correct data by using redundant data and retries. It will soon fail.

----------

## Sysa

Let me to correct you a little:

 *NeddySeagoon wrote:*   

> m27315,
> 
> The drive is dying:-
> 
> ```
> ...

 

FYI: It's OK (until Raw_Read_Error_Rate==Hardware_ECC_Recovered), just some OEM (e.g. Seagate) show it since other hide this info.

The worst thing is 

```
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       5

198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       5 

```

It means that the HDD relocation table is full and it is impossible to relocate other (5) existing bad blocks...

Any way, your conclusion (diagnoze  :Wink:  ) is correct:

 *Quote:*   

> ... It will soon fail.

 

----------

## m27315

The HDD died last night!   :Sad: 

Thanks for the help and explanations!  Now I will better understand the warning signs next time.   :Smile: 

----------

## tnt

```
Model Family:     Seagate Barracuda 7200.11 
```

read this:

https://forums.gentoo.org/viewtopic-t-729107.html

----------

