# Data-disk or controller failure

## Progman3K

This computer is about ten years old and the first problem I had with it was a disk failure about six months ago, now the replacement-disk has failed too:

I was using the disk when it became read-only.

If I leave the drive mentioned in fstab

/dev/sdb1	/home/progman3k reiserfs noatime 0 1

Then the computer will either boot and stay in text-mode asking for the root password for maintenance

or it will begin to fsck the drives and eventually the screen switches off and even if I wait for hours, nothing will happen.

Removing the disk from fstab and restarting, I can log in as root and try to mount the drive with the disk-utility:

 *Quote:*   

> Error mounting /dev/sdb1 at /run/media/progman3k/Data: Command-line `mount -t "reiserfs" -o "uhelper=udisks2,nodev,nosuid" "/dev/sdb1" "/run/media/progman3k/Data"' exited with non-zero exit status 32: mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
> 
>        missing codepage or helper program, or other error
> 
>        In some cases useful info is found in syslog - try
> ...

 

dmesg reports

 *Quote:*   

> [32105.122423] REISERFS (device sdb1): found reiserfs format "3.6" with standard journal
> 
> [32105.122440] REISERFS (device sdb1): using ordered data mode
> 
> [32105.122442] reiserfs: using flush barriers
> ...

 

What should I do?

If the drive-controller is defective then I worry I'll only make things worse. I'm thinking of putting the drive in a newer computer.

How can I recover the files?

Thanks!

----------

## NeddySeagoon

Progman3K,

Install smartmontools and see what the drive thinks of itself.

If the smart data is OK, it may be a HDD data cable.

```
[32106.257503] ata4.00: failed command: READ DMA
```

means a read failed.

It may be that the drive left it too late to relocate a failing sector and now can't read it.

Post the smart data if you need a hand to interpret it.

Its a little early to think about data recovery but making an image with ddrescue is a good place to start.

Be sure to make the ddrescue log as ddresuce will use it to resume and do retries.

----------

## Progman3K

```
smartctl -i /dev/sdb
```

 *Quote:*   

> smartctl 6.1 2013-03-16 r3800 [x86_64-linux-3.10.17-gentoo] (local build)
> 
> Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
> 
> === START OF INFORMATION SECTION ===
> ...

 

```
smartctl -H /dev/sdb
```

 *Quote:*   

> smartctl 6.1 2013-03-16 r3800 [x86_64-linux-3.10.17-gentoo] (local build)
> 
> Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
> 
> === START OF READ SMART DATA SECTION ===
> ...

 

What should I do now?

----------

## Jaglover

Look at  smartctl -data -A /dev/sdb output, also run a test smartctl -t long /dev/sdb.

----------

## Progman3K

```
smartctl -data -A /dev/sdb
```

```
smartctl 6.1 2013-03-16 r3800 [x86_64-linux-3.10.17-gentoo] (local build)

Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       2390

  3 Spin_Up_Time            0x0027   175   175   021    Pre-fail  Always       -       2233

  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       16

  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       7

  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0

  9 Power_On_Hours          0x0032   094   094   000    Old_age   Always       -       4906

 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0

 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0

 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       16

192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       11

193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       4

194 Temperature_Celsius     0x0022   105   091   000    Old_age   Always       -       38

196 Reallocated_Event_Count 0x0032   193   193   000    Old_age   Always       -       7

197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       21

198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       90

200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0
```

Is that very worrisome? 

The long test should be finished around 1AM EDT

----------

## Jaglover

I'd say there seems to be a problem. Let's see if test finishes.

```
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       21 
```

----------

## NeddySeagoon

Progman3K,

```
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE 

197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       21 
```

Thats bad news.  The drive would like to reallocate 21 sectors but can't read them.  You may or may not get lucky and get one more read.

Keep an eye on this data.

Run ddrescue to make an image of the drive, that means you need another 

```
User Capacity: 1,000,204,886,016 bytes [1.00 TB] 
```

of space plus the ddrescue log file.

If the long test completes, the Current_Pending_Sector count should be zero and the   5 Reallocated_Sector_Ct should have increased by at least 21.

Its unlikely the test will complete, as it will abort at the first error.

----------

## Progman3K

I imagine the test has completed, not sure how to read the results.

```
smartctl -l error /dev/sdb
```

```
smartctl 6.1 2013-03-16 r3800 [x86_64-linux-3.10.17-gentoo] (local build)

Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===

SMART Error Log Version: 1

ATA Error Count: 60 (device log contains only the most recent five errors)

   CR = Command Register [HEX]

   FR = Features Register [HEX]

   SC = Sector Count Register [HEX]

   SN = Sector Number Register [HEX]

   CL = Cylinder Low Register [HEX]

   CH = Cylinder High Register [HEX]

   DH = Device/Head Register [HEX]

   DC = Device Command Register [HEX]

   ER = Error register [HEX]

   ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 60 occurred at disk power-on lifetime: 4906 hours (204 days + 10 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 38 ca e9 00 e0  Error: UNC 56 sectors at LBA = 0x0000e9ca = 59850

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  c8 00 38 ca e9 00 e0 08      00:06:14.436  READ DMA

  c8 00 08 c2 e9 00 e0 08      00:06:14.436  READ DMA

  c8 00 08 ba e9 00 e0 08      00:06:14.436  READ DMA

  c8 00 08 02 ea 00 e0 08      00:06:14.353  READ DMA

  c8 00 08 b2 e9 00 e0 08      00:06:14.353  READ DMA

Error 59 occurred at disk power-on lifetime: 4903 hours (204 days + 7 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 38 ca e9 00 e0  Error: UNC 56 sectors at LBA = 0x0000e9ca = 59850

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  c8 00 38 ca e9 00 e0 08      08:58:42.316  READ DMA

  c8 00 08 c2 e9 00 e0 08      08:58:42.316  READ DMA

  c8 00 08 ba e9 00 e0 08      08:58:42.316  READ DMA

  c8 00 08 02 ea 00 e0 08      08:58:42.316  READ DMA

  c8 00 08 b2 e9 00 e0 08      08:58:42.316  READ DMA

Error 58 occurred at disk power-on lifetime: 4895 hours (203 days + 23 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 38 ca e9 00 e0  Error: UNC 56 sectors at LBA = 0x0000e9ca = 59850

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  c8 00 38 ca e9 00 e0 08      00:35:15.014  READ DMA

  c8 00 08 c2 e9 00 e0 08      00:35:15.014  READ DMA

  c8 00 08 ba e9 00 e0 08      00:35:15.014  READ DMA

  c8 00 08 02 ea 00 e0 08      00:35:14.944  READ DMA

  c8 00 08 b2 e9 00 e0 08      00:35:14.944  READ DMA

Error 57 occurred at disk power-on lifetime: 4894 hours (203 days + 22 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 02 b0 94 00 e0  Error: UNC 2 sectors at LBA = 0x000094b0 = 38064

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  c8 00 02 b0 94 00 e0 08      00:03:13.341  READ DMA

  ec 00 00 00 00 00 a0 08      00:03:13.335  IDENTIFY DEVICE

  ef 03 46 00 00 00 a0 08      00:03:13.335  SET FEATURES [Set transfer mode]

Error 56 occurred at disk power-on lifetime: 4894 hours (203 days + 22 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 08 aa 94 00 e0  Error: UNC 8 sectors at LBA = 0x000094aa = 38058

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  c8 00 08 aa 94 00 e0 08      00:03:13.300  READ DMA

  c8 00 20 3a 94 00 e0 08      00:03:13.300  READ DMA

  c8 00 08 32 94 00 e0 08      00:03:13.300  READ DMA

  c8 00 20 4a 92 00 e0 08      00:03:13.300  READ DMA

  c8 00 08 42 92 00 e0 08      00:03:13.299  READ DMA

```

```
smartctl -l selftest /dev/sdb
```

```
smartctl 6.1 2013-03-16 r3800 [x86_64-linux-3.10.17-gentoo] (local build)

Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline    Completed: read failure       90%      4906         38064

# 2  Short offline       Completed without error       00%      4890         -

# 3  Short offline       Completed: read failure       50%      4890         71512

```

So is it possible to recover some of the data? I understand some might be lost, but I'd settle for some data. What should I do?

----------

## Anon-E-moose

I had some problems a few years back where the disks would go offline (drop sata connection)

and gave disk errors. In two cases the disks had bad spots, and I exchanged them.

In the other cases the culprit was the power supply, it just wasn't putting out enough current

to satisfy all the disks. I swapped it out, ran a complete disk check (using hd manf's utility on another machine) 

and then reloaded all the filesystem data that I had backups for. 

In your case, I would try and recover all the data that you can, keeping in mind some of it may be bad.

If you have another machine you might try running disk software on that drive in the other machine.

If it shows as clean, then it might be the power supply (as they're not really designed to last for much longer than 4-5 years)

Anyway, good luck

Edit to add:

 *Quote:*   

> So is it possible to recover some of the data? I understand some might be lost, but I'd settle for some data.

 

It depends on what is in those unreadable sectors, if boot data, partition data, or directory data then whatever is in that spot would more than likely be lost.

----------

## NeddySeagoon

Progman3K,

```
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline    Completed: read failure       90%      4906         38064 
```

The test failed, the first problem is at block 38064.

What you cang get back depends on the type of data that cannot be read.

If its a directory, all of that directory and all of its children will be threatened.  If its a video, the video will be lost.

The next step is to make an image of the drive or partition with ddrescue. You must not compress the image as you will use it like a real drive/partition later.

ddrescue tries very hard to read your data. It starts at the beginning, much like dd but when it finds a problem, jumps to the middle of the unrecovered space.  This way it gets all the easy to read data first.

Eventually, it attempts to step the head into problem areas from both sides.

You must generate the log file.  ddrescue will read it to know what still needs to be done, so you don't have to run ddrescue in one sitting.

You will read it to know what is lost.

Recovery of files in place is no longer possible, nor would you want to as you have no idea when the 21 unreadable blocks will get worse.

Those are the ones the drive knows about.  ddrescue may well find some more.

Check your warranty it looks like it expires on 16 May 2016.

WD will send you a replacement before you return the dud. I've had to do it with 2 WD Greens.  They want a credit card number, which they will bill if the return is not received.

I had 30 days, so you get a few weeks with both drives at no cost.

----------

## Progman3K

Do I have to copy to a disk of EXACTLY the same size?

There is one partition on the disk, and that is what I want to recover.

So must I have exactly the same block-size, format and everything for ddrescue to work?

How big will the logfile be? Do I have to have the same amount of space to store it as the source requires?

I've never used ddrescue before. I've taken an example from the documentation: To copy from sdb1 to sdb2

```
ddrescue -f -n /dev/sdb1 /dev/sdb2 logfile

ddrescue -d -f -r3 /dev/sdb1 /dev/sdb2 logfile

fsck -v -f /dev/sdb2

mount -t ext2 -o ro /dev/sdb2 /mnt

```

Is that the proper form for what I want to do?

----------

## NeddySeagoon

Progman3K,

You can copy to an output drive that is the same size or bigger.

The copy need only be temporary, to allow data recovery to take place.

If you will do a drive to drive copy, tell ddrescue to work with whole drives, not partitions, that way you get the partition table too.

Partition to partition works too, as does drive/partition to file.

The output space must be >= the input space.

ddrescue, is a block level tool. It neither knows nor cares what the data in the blocks it works with represent.

The size of the log depends on how many separate faulty areas you have on your drive.  If it got to 1Mb I would be surprised.

Here is one I made earlier.

```
# Rescue Logfile. Created by GNU ddrescue version 1.15

# Command line: ddrescue -b 4096 -r 8 -f /dev/sde3 /dev/null /root/rescue_log.txt

# current_pos  current_status

0x18D786D0000     ?

#      pos        size  status

0x00000000  0x16E4BE9E000  +

0x16E4BE9E000  0x00002000  *

0x16E4BEA0000  0xFD4F9D000  +

0x17E20E3D000  0x00003000  *

0x17E20E40000  0x8FBF8000  +

0x17EB0A38000  0x00008000  *

0x17EB0A40000  0x358CC2000  +

0x18209702000  0x0000E000  *

0x18209710000  0x2DE00000  +

0x18237510000  0x00010000  *

0x18237520000  0x01AC0000  +

0x18238FE0000  0x00010000  *

0x18238FF0000  0x012C1000  +

0x1823A2B1000  0x0000F000  *

0x1823A2C0000  0x2D752000  +

0x18267A12000  0x0000E000  *

0x18267A20000  0x11EDD4000  +

0x183867F4000  0x0000C000  *

0x18386800000  0x9F1ED0000  +

0x18D786D0000  0x4260530000  ?
```

The command line is worth looking at 

```
ddrescue -b 4096 -r 8 -f /dev/sde3 /dev/null /root/rescue_log.txt
```

Here I was just trying to force bad sectors to be read and relocated, hence /dev/null is the output device.

I was trying to breath some life into a raid5 that had two drives fail within 15 minutes.

The -b tells not to try to use blocks smaller than 4k.  You need that as your drive has a 4k physical block size.  Thats the smallest chunk of data that it can read or write.

-r 8 says to move on after 8 retries.  Big numbers here make ddrescue have more tries but it takes longer.

-f always seems to be needed

/dev/sde3 /dev/null are the input and output devices.  /dev/null is a really bad choice for output but as my raid set was not readable (at all) I had nothing to loose. Don't you do this.

/root/rescue_log.txt is the full path to the logfile.

The usual way to use ddrescue is with a fairly low number of retries, say, -r 3 then read the log and see what happened.

Then run it again with a more aggressive retry setting and everything else unchanged. 

ddrescue will only work on the areas that have not been recovered and will fill in the 'holes' in the copy. 

-M is useful on second and subsequent runs.  Read man ddrescue.

You can also help the probability of recovery.   More on that after you post your first log.

Using ddrescue to copy to/from the same drive is not useful.  The drive is suspect - all of it.  Get the data off. 

Do not use fsck on the recovered data - it often makes a bad situation worse. Its a last ditch thing to try.

----------

## Progman3K

I connected the hard-disk (c1) to a new machine with a new cable, to copy everything over to a new disk (b1)

```
ddrescue -v -f -r512 /dev/sdc1 /dev/sdb1 logfile
```

It's been running for 15 hours *Quote:*   

> GNU ddrescue 1.16
> 
> About to copy 500105 MBytes from /dev/sdc1 to /dev/sdb1
> 
>     Starting positions: infile = 0 B,  outfile = 0 B
> ...

 

Sometimes the current rate shoots up. I don't mind if it takes it a long time as long as it gets the data.

Thank you for you help! I'll post back at the end!

----------

## Progman3K

After months of running the task and the problematic disk no longer functioning at all, meaning it is no longer detectable, so I cannot try to get any more data from it, this is the result:

```
ddrescuelog --show-status logfile
```

 *Quote:*   

> 
> 
> current pos:   116848 MB,  current status: splitting
> 
> domain size:   500105 MB,  in    1 area(s)
> ...

 

So now what? How do I access the data? Do I simply try to mount the partition?

----------

## krinn

 *Progman3K wrote:*   

> So now what? How do I access the data? Do I simply try to mount the partition?

 

Might also be the perfect time to pray.

----------

## Progman3K

I crossed my fingers, issued the mount command and it mounted.

I admit, I got kind of choked-up for a moment.

Of course there is 500GB of data so it's probable some of it is garbled but at least it is accessible!

Thanks for the help, people!

----------

