# Need help interpreting smartctl errors.

## Budoka

I noticed some wonky behavior with my laptop, particularly when booting into the Windows partition. The disk read light is active 100 percent of the time. The Strange thing is when booting into the Gentoo partition the same thing doesn't happen.

Anyway I am trying to figure out if my HD is damaged and if so is it "unrepairable I shouldn't use it anymore" type damage. But I am having trouble interpreting the smartctl error reports...and it also confuses me but states that there are errors but the " General Health" "Passed". Anyway thanks as always.

```
smartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.9.34-gentoo] (local build)

Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===

Model Family:     Seagate Samsung SpinPoint M8 (AF)

Device Model:     ST1000LM024 HN-M101MBB

Serial Number:    S2RQJ9DC508787

LU WWN Device Id: 5 0004cf 2076cdc38

Firmware Version: 2AR10002

User Capacity:    1,000,204,886,016 bytes [1.00 TB]

Sector Sizes:     512 bytes logical, 4096 bytes physical

Rotation Rate:    5400 rpm

Form Factor:      2.5 inches

Device is:        In smartctl database [for details use: -P show]

ATA Version is:   ATA8-ACS T13/1699-D revision 6

SATA Version is:  SATA 3.0, 3.0 Gb/s (current: 3.0 Gb/s)

Local Time is:    Sun Sep  3 16:25:01 2017 JST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status:  (0x80)   Offline data collection activity

               was never started.

               Auto Offline Data Collection: Enabled.

Self-test execution status:      ( 121)   The previous self-test completed having

               the read element of the test failed.

Total time to complete Offline 

data collection:       (13080) seconds.

Offline data collection

capabilities:           (0x5b) SMART execute Offline immediate.

               Auto Offline data collection on/off support.

               Suspend Offline collection upon new

               command.

               Offline surface scan supported.

               Self-test supported.

               No Conveyance Self-test supported.

               Selective Self-test supported.

SMART capabilities:            (0x0003)   Saves SMART data before entering

               power-saving mode.

               Supports SMART auto save timer.

Error logging capability:        (0x01)   Error logging supported.

               General Purpose Logging supported.

Short self-test routine 

recommended polling time:     (   2) minutes.

Extended self-test routine

recommended polling time:     ( 218) minutes.

SCT capabilities:           (0x003f)   SCT Status supported.

               SCT Error Recovery Control supported.

               SCT Feature Control supported.

               SCT Data Table supported.

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always       -       22023

  2 Throughput_Performance  0x0026   252   252   000    Old_age   Always       -       0

  3 Spin_Up_Time            0x0023   089   088   025    Pre-fail  Always       -       3463

  4 Start_Stop_Count        0x0032   083   083   000    Old_age   Always       -       17715

  5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail  Always       -       0

  7 Seek_Error_Rate         0x002e   252   252   051    Old_age   Always       -       0

  8 Seek_Time_Performance   0x0024   252   252   015    Old_age   Offline      -       0

  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       17559

 10 Spin_Retry_Count        0x0032   252   252   051    Old_age   Always       -       0

 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       425

 12 Power_Cycle_Count       0x0032   097   097   000    Old_age   Always       -       3383

191 G-Sense_Error_Rate      0x0022   100   100   000    Old_age   Always       -       225

192 Power-Off_Retract_Count 0x0022   252   252   000    Old_age   Always       -       0

194 Temperature_Celsius     0x0002   061   039   000    Old_age   Always       -       39 (Min/Max 10/63)

195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always       -       0

196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always       -       0

197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       22

198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       21

199 UDMA_CRC_Error_Count    0x0036   200   200   000    Old_age   Always       -       0

200 Multi_Zone_Error_Rate   0x002a   001   001   000    Old_age   Always       -       65716

223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       425

225 Load_Cycle_Count        0x0032   094   094   000    Old_age   Always       -       65321

SMART Error Log Version: 1

ATA Error Count: 3

   CR = Command Register [HEX]

   FR = Features Register [HEX]

   SC = Sector Count Register [HEX]

   SN = Sector Number Register [HEX]

   CL = Cylinder Low Register [HEX]

   CH = Cylinder High Register [HEX]

   DH = Device/Head Register [HEX]

   DC = Device Command Register [HEX]

   ER = Error register [HEX]

   ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 3 occurred at disk power-on lifetime: 17172 hours (715 days + 12 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 08 18 14 8b ee  Error: UNC 8 sectors at LBA = 0x0e8b1418 = 243995672

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  c8 00 08 18 14 8b ee 00      00:00:17.003  READ DMA

  35 00 c0 40 b5 d9 e0 00      00:00:17.003  WRITE DMA EXT

  35 00 40 00 b0 d9 e0 00      00:00:17.003  WRITE DMA EXT

  35 00 c0 40 ad d9 e0 00      00:00:17.003  WRITE DMA EXT

  35 00 40 00 a8 d9 e0 00      00:00:17.003  WRITE DMA EXT

Error 2 occurred at disk power-on lifetime: 17168 hours (715 days + 8 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 08 18 14 8b ee  Error: UNC 8 sectors at LBA = 0x0e8b1418 = 243995672

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  c8 00 08 18 14 8b ee 00      00:00:02.845  READ DMA

  c8 00 08 20 fb 8b ee 00      00:00:02.845  READ DMA

  c8 00 08 d0 39 8c ee 00      00:00:02.845  READ DMA

  c8 00 08 d0 1d cc ee 00      00:00:02.845  READ DMA

  c8 00 08 f0 1d cc ee 00      00:00:02.845  READ DMA

Error 1 occurred at disk power-on lifetime: 17168 hours (715 days + 8 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 08 18 14 8b ee  Error: UNC 8 sectors at LBA = 0x0e8b1418 = 243995672

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  c8 00 08 18 14 8b ee 00      00:00:02.826  READ DMA

  25 00 08 c8 16 4c e0 00      00:00:02.826  READ DMA EXT

  25 00 08 b8 16 4c e0 00      00:00:02.826  READ DMA EXT

  25 00 08 98 16 4c e0 00      00:00:02.826  READ DMA EXT

  25 00 08 90 16 4c e0 00      00:00:02.826  READ DMA EXT

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline       Completed: read failure       90%     17559         53857424

# 2  Extended offline    Completed: read failure       90%     17205         48641664

# 3  Short offline       Completed: read failure       90%     17205         53857424

SMART Selective self-test log data structure revision number 0

Note: revision number not 1 implies that no selective self-test has ever been run

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Completed_read_failure [90% left] (0-65535)

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.
```

Code tags added for easy reading -- NeddySeagoon

----------

## frostschutz

Your drive has read errors - get a new one.

 *Quote:*   

> but the " General Health" "Passed".

 

This is a false friend, it's common for this to say "passed" even on a drive that is completely a goner.

You have to look at the reallocated/pending/uncorrectable sector counts. If they are not zero, the drive has issues.

You even ran selftests that ended with read failure. Get a new drive asap. If you don't have a backup, use ddrescue to try and copy your data over.

----------

## NeddySeagoon

Budoka,

The drive is scrap.

```
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       22
```

It has 22 sectors is knows it can't read.  There may be more.

```
196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always       -       0
```

its failed to reallocate any sectors so far.

What is supposed to happen is that the drive detects when a sector is getting difficult to read and copies the data to a 'spare', so its not lost.

This hides bad blocks from the operating system.

----------

## Budoka

 *NeddySeagoon wrote:*   

> Budoka,
> 
> The drive is scrap.
> 
> ```
> ...

 

Well, that bites. Thanks for the info everyone. This forum has yet to let me down.

----------

## NTU

 *NeddySeagoon wrote:*   

> The drive is scrap.

  I read this as "the drive is crap."

----------

## Hu

Well, he didn't even get to 2 power-on years, so that's not an entirely unreasonable interpretation.  :Smile:   715 power-on days = 1.95 years.

This drive looks like it sees fairly frequent cycles though.  3383 power cycles in ~730 power-on days, with a start-stop-count of 17715.  That suggests it was start/stopped more than once per power-on hour, which is extremely high for my taste.  Some drives may be meant to operate that way, but I prefer to see start-stop-count much closer to power-cycle-count.

----------

## Budoka

 *Hu wrote:*   

> Well, he didn't even get to 2 power-on years, so that's not an entirely unreasonable interpretation.   715 power-on days = 1.95 years.
> 
> This drive looks like it sees fairly frequent cycles though.  3383 power cycles in ~730 power-on days, with a start-stop-count of 17715.  That suggests it was start/stopped more than once per power-on hour, which is extremely high for my taste.  Some drives may be meant to operate that way, but I prefer to see start-stop-count much closer to power-cycle-count.

 

Any idea why that would be? The laptop is about 5 years old. It is set up to dual boot but spends 98 percent of the time booted into Gentoo plugged into power.

----------

## Hu

The manufacturer might have configured it to park whenever it was left idle for longer than N seconds, where N is typically chosen by profiling Windows to see what seems acceptable.  Using non-Windows systems with such a disk may cause it to park far more often than you want.  As I said above, your drive might have been designed to tolerate frequent start/stop cycles, in which case that high count should not be cause for concern.  I distrust that drives with aggressive stopping like that always are built to tolerate frequent cycles.

----------

## Jaglover

Laptop drives probably are designed to work like this, there is sys-apps/idle3-tools for WD drives.

----------

## R0b0t1

 *Jaglover wrote:*   

> Laptop drives probably are designed to work like this, there is sys-apps/idle3-tools for WD drives.

 I think there is some nuance required here. The drives may have been told to spin up and down frequently in software, but the mechanical parts may not have been redesigned to handle the stresses this produces. There are a lot of failures reported with WD "green" products which are ostensibly caused by this distinction.

OP, I would interpret the output of smartctl as follows: 22 sectors are bad but no reallocations have been performed because too much of the disk is silently damaged to perform them. You should probably image your disk with ddrescue and then perform a backup off of that disk image. It is likely your disk might start to experience more issues if accessed in its entirety. Ddrescue will allow you to resume your imaging in the least destructive way possible.

----------

## Jaglover

Never heard of green laptop drive.    :Rolling Eyes: 

----------

## Budoka

 *Hu wrote:*   

> The manufacturer might have configured it to park whenever it was left idle for longer than N seconds, where N is typically chosen by profiling Windows to see what seems acceptable.  Using non-Windows systems with such a disk may cause it to park far more often than you want.  As I said above, your drive might have been designed to tolerate frequent start/stop cycles, in which case that high count should not be cause for concern.  I distrust that drives with aggressive stopping like that always are built to tolerate frequent cycles.

 

Would it be safe to say then that Gentoo may not be the best option for laptops or is that a leap?

----------

## Budoka

Thank you again everyone. Couple of quick questions.

1) So the consensus is that ddrescue is the best tool for me to mirror/backup my drive?  Say vs rsync or something like that?

2) Does the drive I back up the mirror to have to be the exact same size as the failing drive? 1TB. Or can it be larger?

3) I am running dual boot (no choice) Win/Gentoo. The Gentoo is LUKS on LVM. Will I be able to mirror the entire drive and then transfer to a new drive without repartitioning/reinstalling either OS?

----------

## NeddySeagoon

Budoka,

ddrescue reads the raw blocks from your drive with no regard to the content. It does not read files nor use the kernel filesystem drivers.

Unlike plain dd, on which it is based, it has some strategies to not stop when it encounters a bad block and even 'sneak up' on a bad block to try to get just one more read.

You need space to put the disk image, which is 1TB and the log file.  You must write the log file so that ddrescue can use it to resume.  

The backup can be to a file, it need not be to another disk.

You can use losetup and the loop module to mount the partitions inside the file and copy them to a HDD later.

The simple answer to 3 is maybe.  In theory yes but it all depends on what data is lost in the bad blocks.

If a bad block is in a file, that file is damaged.

If its in a directory, the directory is damaged and you may loose access to it and all lower level directories.   The data may still be there, its no longer accessible using the file system.

If the damage is in filesystem metadata, it gets worse still.

The 22 blocks that the drive knows about as bad are all the ones its tried to read.

Your long SMART test stopped with 90% remaining so that's where the first error is.   

Run ddrescue to make an image.  Once you have your image, we can talk about how to get more back.  That's why you must save the ddrescue log file.

You will use it several time.

As ddrescue works, you will notice that the numbers in 

```
196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always       -       0

197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       22 
```

change.

If the  Current_Pending_Sector count reaches zero, your data has been read and relocated.  The drive is still scrap.

You have a drive that can no longer be trusted to read its own writing.

If the drive is still under warranty (its always worth checking) the smartctl output you posted will be accepted as proof of failure.

----------

## P.Kosunen

 *Budoka wrote:*   

> 
> 
> ```
> 197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       22
> ```
> ...

 

On some drives these pending sector problems can occur with power related failures, sectors are not broken, but checksum does not match to data (don't know if this is some drives internal checksum or other). These go away if you write whole disk full again, writing all zeros should be enough. If full write fails, there is real problems and you should replace drive.

I've had these errors in 3.5" WD Green in external USB case with faulty power supply.

----------

## Budoka

 *NeddySeagoon wrote:*   

> Budoka,
> 
> ddrescue reads the raw blocks from your drive with no regard to the content. It does not read files nor use the kernel filesystem drivers.
> 
> Unlike plain dd, on which it is based, it has some strategies to not stop when it encounters a bad block and even 'sneak up' on a bad block to try to get just one more read.
> ...

 

Thank you for the explanation. I am always learning something on the board and appreciate it. I am reading through the ddrescue documentation now.  I was going to do it with my existing external 1TB backup drive but it seems it would require me to dump all the existing backup data on it to do so which I am not comfortable with at this point so will pick up  some kind of storage media that will allow me to save the 1TB plus logfile, get to it, and report back.

Great tip about warranty. Will check that out too.

----------

## R0b0t1

 *Budoka wrote:*   

>  *Hu wrote:*   The manufacturer might have configured it to park whenever it was left idle for longer than N seconds, where N is typically chosen by profiling Windows to see what seems acceptable.  Using non-Windows systems with such a disk may cause it to park far more often than you want.  As I said above, your drive might have been designed to tolerate frequent start/stop cycles, in which case that high count should not be cause for concern.  I distrust that drives with aggressive stopping like that always are built to tolerate frequent cycles. 
> 
> Would it be safe to say then that Gentoo may not be the best option for laptops or is that a leap?

 

I think it is safe to say that Linux in general is not the best option for laptops. The vast majority of the power management interface for mobile devices is proprietary and hard to reverse engineer (how do you verify hardware is sleeping in its deepest sleep mode?). Linux is still catching up, though admittedly it has gotten far better. If you use Linux you will get a fraction of the advertised battery life. The one exception might be Chromebooks, which are sold with a version of Linux on them. But while you can technically run Linux on them, the ecosystem does not feel very open. I do not see why the board firmware is unupgradeable, for example.

If you plan on using your laptop in a truly mobile fashion, you may be better off without Linux. You might even need to avoid open source software when using Windows - it's my experience that everything but the Microsoft written applications use far more power than they should. On one hand, Microsoft's programs must be well designed to make use of processor wait and sleep states, but on the other hand, they are almost assuredly leveraging secret information to create their products.

The above and other reasons make me want open hardware very badly, but it does not seem like the market will support it. This makes me sad. I pray almost every day for the creation of open hardware systems, but it looks like I will have to wait until I am in Heaven, if I am good enough to go there.

----------

## Hu

 *Budoka wrote:*   

> Would it be safe to say then that Gentoo may not be the best option for laptops or is that a leap?

 As R0b0t1 says, Linux on laptops is a bit less pleasant than Linux on desktops.  However, I think it is premature to conclude that this particular drive failure is a data point in that larger argument.  We have a general suspicion that high cycle counts are bad for a drive, and widespread anecdotes that Western Digital Green drives did not handle high cycle counts gracefully.  Your drive is not a Western Digital Green drive.  We have no specific evidence that the high cycle count is harmful to your drive.  It's possible that this manufacturer expected the high cycle count and engineered the drive to tolerate it.  It's possible they didn't.  Even if we assume they didn't and that therefore a high cycle count would shorten the drive's lifetime, we also don't have hard evidence that the high cycle count led to the failure that prompted you to start this thread.  Drives have multiple ways to fail, and even if your drive was on its way to an early death from excess cycles, it may have died a different way (bad sectors) first.

I have used Linux on a laptop before, and was quite satisfied with the result.  Your results will likely depend heavily on whether the laptop is a model that provides good Linux support.  Some vendors are very bad about designing their hardware in ways that make it difficult for Linux to fully support the device.

----------

## Budoka

Hello, everyone. Reporting back in and I have a couple of additional questions.

It took 4 days to back up the drive and then run ddrescue on it. Seemed incredibly slow to me, appx 24 hours on the first pass, but not complaining because now I have a ddrescue image of the drive with logfile.

I will mirror the image to the new drive I purchased but I am curious about a couple of things.

1) Should I run ddrescue again to try to recover as much data as possible? Or is only one pass necessary? By that I mean ddrescue could only save what it could and another pass would yield the same result.

2) The failing drive is LUKS encrypted on LVM/ I already formatted the new drive the same way, LUKS on LVM. Do I have to do anything special other than dd or ddrescue the image to the new drive?

Thanks again all.

----------

## Budoka

 *P.Kosunen wrote:*   

>  *Budoka wrote:*   
> 
> ```
> 197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       22
> ```
> ...

 

Hmm. So you think before cracking open the laptop and swapping out the HD I should try this first? Or are the odds that this isn't my problem?

Thanks

----------

## NeddySeagoon

Budoka,

Post the ddrescue log.  It tells what was recovered and what was not.

The smartctl data, now you have run ddrescue, would be useful too.

----------

## Budoka

 *NeddySeagoon wrote:*   

> Budoka,
> 
> Post the ddrescue log.  It tells what was recovered and what was not.
> 
> The smartctl data, now you have run ddrescue, would be useful too.

 

Thanks Neddy. Much appreciated.

 *Quote:*   

> # Rescue Logfile. Created by GNU ddrescue version 1.16
> 
> # Command line: ddrescue -d -f -r3 -v /dev/sda /mnt/3TB/DDRescue/ddrescue_image/sda_rescue.img /mnt/3TB/DDRescue/ddrescue_image/rescue.log
> 
> # current_pos  current_status
> ...

 

----------

## P.Kosunen

 *Budoka wrote:*   

> Hmm. So you think before cracking open the laptop and swapping out the HD I should try this first? Or are the odds that this isn't my problem?

 

In case of laptop odds might not be that good.

https://superuser.com/questions/979563/reallocate-bad-sector-linux

Writing just broken sectors should tell if it is the case. If pending sectors become bad sectors, they are really broken.

----------

## NeddySeagoon

Budoka,

The lines ending in + are start address (in hex bytes) and size of recovered areas.

The lines ending in - are start addresses and size of areas not yet recovered.

Thus 

```
0x6A4439000 0x00000200 - 
```

 is a single disk block that cannot be read.

```
0x66B992000 0x00001000 - 
```

is a sequence of eight disk blocks that cannot be read. 

```
ddrescue -d -f -r3 -v ...
```

-r is number of retries. Use -r256 I'll explain why later.

Add -M -A so that each failed area is treated as unknown after each retry. 

Otherwise, run the same command again.  ddrescue will read the log and not attempt to read already recovered areas.

As it runs, every 8 retries or so, turn the drive (whole laptop?) onto a different face/edge, whatever.

The idea is to use gravity to help coax just one more read from the unread areas.

When that's complete do it all over again with -R added to the command.  This tells ddrescue to work from the inside of the drive out.

Again, every 8 retries or so, turn the drive (whole laptop) onto a different face/edge whatever, you are still trying for just one more read, so you get your data back.

One depressing thought. log entries like 0x6A4439000 0x00000200 - mean that one disc block cannot be read.

However, the filesystem block size is probably 4k or 1000 in hex.  That missing disc block means that the entire filesystem block cannot be read. 

Its worth trying several more goes yet, since you don't know what's missing.

LUKS makes it more complex.  I have no idea how it deals with gaps in data.

Its easy to work out which partitions are affected. You can try to mount the partitions in the image and have a look around as long as you use the -ro option.

The filesystems are damaged. Do not write to them.

----------

## NeddySeagoon

P.Kosunen,

Writing the pending sectors loses the data for good.

----------

## Hu

LUKS reserves some space for its own use storing key material near the front of the underlying storage.  As I understand it, data blocks are not specifically tracked as present/absent, so sectors unwritten due to lost data will return the result of decrypting a blank sector from the new drive.  This will almost certainly be garbage.  If you are lucky, it will be garbage in a free area that you do not need.

----------

