# Issues checking a disk [Solved]

## iandoug

Hi

We're are suffering loadshedding again which means frequent shut downs and reboots.

Yesterday at boot, it found and corrected some errors on sde1

Today it got stuck on sde1 again, then said UNEXPECTED INCONSISTENCIES, run frck manually.

Then it just sat there and looked at me.... eventually I pressed the reset button, on this reboot i got stuck there again and eventually printed message similar to below, then finished boot after a while.

After booting, I ran fsck.

```

trooper /home/ian # fsck -y /dev/sde1

fsck from util-linux 2.37.2

e2fsck 1.46.4 (18-Aug-2021)

fsck.ext3: Attempt to read block from filesystem resulted in short read while trying to open /dev/sde1

Could this be a zero-length partition?

```

Is there some software that can fix this? (as in, find the back-up tables and restore) 

Will shut down now and check that it's not a loose cable or something ...

Thanks, Ian

----------

## NeddySeagoon

iandoug,

Is the partition table still intact?

What does 

```
fdisk -l /dev/sde
```

tell.

What is sde1 used for?

What filesystem type does it contain?

fsck often makes a dad situation worse. If you don't have backups make a copy of all of sde.

That will be your 'undo' if fsck just digs a deeper hole for you.

----------

## iandoug

Hi Neddy

fdisk finds nothing.

Forgot to mention that during boot it says that that drive  was not registered with DBUS, "despite waiting 1000..000 ms".

Tried to find exact message in dmesg, it's not there, but all this is:

https://pastebin.com/hxi9bqU5

It's a 500GB Data disk (WD, FWIW), EXT4 I think. Don't think there was anything crucial on there. Have no backups, my tape drive is broken...  :Sad: 

I see one of the fans is also not running. But drive is sharing power with another drive which is ok.

Let me see if another SATA channel works ...

----------

## mike155

iandoug, dmesg shows many hardware/bus errors. 

Fix the hardware before running fsck - otherwise it is possible that fsck deletes your data...

----------

## iandoug

 *mike155 wrote:*   

> iandoug, dmesg shows many hardware/bus errors. 
> 
> Fix the hardware before running fsck - otherwise it is possible that fsck deletes your data...

 

Not sure HOW to do that  :Smile: 

Possibly the auto-checks at boot may already have done the deleting for me ...

I reseated both cables to drive but it made no difference... maybe needs a different SATA channel.

Will revert, just doing some daily work first then can fiddle more.

Thanks, Ian

----------

## NeddySeagoon

iandoug,

```
[   52.992983] ata6.00: cmd 60/00:b0:3f:00:00/01:00:00:00:00/40 tag 22 ncq dma 131072 in

                        res 40/00:b4:3f:00:00/00:00:00:00:00/40 Emask 0x10 (ATA bus error)  
```

Can be the SATA cable, the motherboard SATA port, or the interface on the HDD.

Change the SATA data cable as an easy first step.

If you don't have a SATA coble in you box of bits, 'wipe' the connectors both ends by unplugging and replugging them two or three times.

If that works, it won't work for very long.

```
[   77.036803] ata6.00: failed to IDENTIFY (I/O error, err_mask=0x5)
```

means that the kernel knows nothing about the drive, so no other useful communication wiith the drive is possible.

----------

## figueroa

Also, don't ignore the power connection. Easy quick check is swap connections with other SATA device. A recent disk issue on our school's business office desktop turned up a cracked power connector.

What is loadshedding?

----------

## Hu

 *figueroa wrote:*   

> What is loadshedding?

 The power company has insufficient generation capacity to cover demand, so they deliberately blackout some of their customers to get demand down to a level they can cover.  When done fairly, the blackouts rotate so that no one customer stays down for the duration of the imbalance.  However, that also means that powering up electronics without a UPS is a risky affair, since the blackout may rotate back into your area and bring you down again.

----------

## figueroa

Thanks, Hu. When this was first posted, I tried an Internet search for the term and got nothing meaningful. Today, it's all about rolling blackouts, so I may have mistyped it. I have UPSs on every electronic device in the house out of abundance of caution, yet we enjoy just about the most stable power I could ever imagine, and yet it's a rural electric cooperative that buys power primarily from Georgia Power.

ADDED: At the school where I support the network and desktop PCs (500 miles remote from me), we also use UPSs on mission critical computers and network devices. We've been luck and have not suffered any physical damage that can be tracked to power outages. We do, however, have a lot of seriously old (over 8 years) hard drives that periodically swan dive out of service without warning. Fortunately, backups are automatic and redundant.

----------

## iandoug

Load-shedding is the South African term. 

Overseas they use rolling blackouts or brownouts.

We are currently enduring load-shedding since last Thursday, due to end this Thursday.

We're at "Level 2" but how that is implemented depends on where you are and who you get your power from.

I'm in Cape Town and we have a pumped storage hydro-electric scheme, so municipality pumps water to top dam at night, and then generates during the day, meaning that Cape Town reverts to Level 1 during day while rest of country is at level 2.

I have UPS in office and another for the NAS servers, Fibre box, etc, but they can't stay up for the 2 - 2.5 hours that we shed at a time.

So I need to keep shutting down the computers (except firewall) and restarting, which is a pain when they go off from 2AM to 4AM etc ... I have cron jobs that run in the night/early morning.

I think two of my older PCs suffered damage because neither will boot ... they were in another room sans UPS... it's the power surge when it comes on I think ... as the power network itself stabilises. Been reports of people losing fridges etc as well.

Cheers, Ian

----------

## iandoug

Hi guys

Okay, I transplanted the problematic drive from the Trooper box to the Fractal box. The Trooper box dates from 2013 and has basically been on 24/7 since then.

I no longer get those hardware errors when booting Trooper. Possibly the drives are now running at 6 instead of 1.5. The PC was "sluggish" before.

Installing in Fractal meant rearranging the internal structure, which was not without issues....

Anyway, the BIOS/Kernel/Whatever is now playing silly buggers with me... plugging in the drive breaks the boot process... either because it can't find sda3, or can't find the madm drives.

After much creative rearranging of cables to sockets, I have given up and decided to switch to UUID syntax in fstab. Why this is necessary is beyond me, drives should be allocated in order of hardware port number, not randomly.

Now I need some advice in fixing fstab so that it still boots and does not leave me looking for a rescue thumbdrive.  :Smile: 

I have:

```

# blkid

/dev/md127: UUID="04d91321-b09c-4a74-ac73-a0c5f108ac96" BLOCK_SIZE="4096" TYPE="ext4"

/dev/sdb1: UUID="0270db76-3666-5b96-d9cb-e1224a65bf2c" UUID_SUB="77b7878f-5f94-8b6a-2c5c-408fe52be6f1" LABEL="fractal:1" TYPE="linux_raid_member" PARTUUID="c9e49a2e-60e2-fa4c-bb85-0b302264b40c"

/dev/sdc1: UUID="0270db76-3666-5b96-d9cb-e1224a65bf2c" UUID_SUB="1bdcbed0-b379-d684-4d50-7ed0cd67ee04" LABEL="fractal:1" TYPE="linux_raid_member" PARTUUID="2e50ed27-0faa-234b-b2ca-674cd0328bf1"

/dev/sda2: UUID="c10d95a5-db1c-473f-87b9-c2d9c108c286" TYPE="swap" PARTUUID="e549871a-9a5d-8c4e-8459-a8dc4702dde9"

/dev/sda3: UUID="14e0225a-638b-49fd-ae9d-d2f3a807fcec" BLOCK_SIZE="4096" TYPE="ext4" PARTUUID="6ecb5d48-00d6-ef4e-9ddc-6efbd75e448f"

/dev/sda1: UUID="8569-D7A3" BLOCK_SIZE="512" TYPE="vfat" PARTUUID="b4f42392-c85d-b64c-8919-7166ddf4db17"

```

And fstab is

```

/dev/sda1              /boot            vfat            defaults,noatime  0 2

/dev/sda2              none             swap            sw                0 0

/dev/sda3              /                ext4            noatime           0 1

/dev/md127             /home            ext4            noatime           0 3

```

So would this be correct?

```

# /dev/sda1

UUID=8569-D7A3   /boot            vfat            defaults,noatime  0 2

# /dev/sda2 

UUID=c10d95a5-db1c-473f-87b9-c2d9c108c286       none             swap            sw                0 0

# /dev/sda3

UUID=14e0225a-638b-49fd-ae9d-d2f3a807fcec         /                ext4            noatime           0 1

# /dev/md127

UUID=04d91321-b09c-4a74-ac73-a0c5f108ac96             /home            ext4            noatime           0 3

```

And I don't need to mention sdb or sdc ?

I still don't follow how this syntax knows which is sda, sdb, sdc, sdd, etc ...

Thanks, Ian

----------

## NeddySeagoon

iandoug,

UUID is a property of a filesystem. If you make a new filesystem over an old one, the UUID is changed.

PARTUUID is a property of a partition. It will not change regardless of what happens to any filesystem that partition may hold, even if the filesystem is changed.

When you give UUID (or PARTUUID) in fstab, mount looks at all known locations until it finds the UUID you specified and returns the kernel major:minor device numbers, and says to the kernel, its that one.

In your situation, the major:minor device numbers keep changing.

The relationship between sd* and the device numbers is fixed.

```
ls /dev/sd* -l

brw-rw---- 1 root disk 8,   0 Sep 30 07:53 /dev/sda

brw-rw---- 1 root disk 8,   1 Sep 12 12:30 /dev/sda1

brw-rw---- 1 root disk 8,   2 Sep 12 12:30 /dev/sda2

brw-rw---- 1 root disk 8,   3 Sep 30 08:24 /dev/sda3

brw-rw---- 1 root disk 8,  16 Sep 12 12:30 /dev/sdb

brw-rw---- 1 root disk 8,  17 Sep 12 12:30 /dev/sdb1

brw-rw---- 1 root disk 8,  18 Sep 12 12:30 /dev/sdb2

brw-rw---- 1 root disk 8,  19 Sep 12 12:30 /dev/sdb3
```

but the kernel allocates the drives as they are detected. That's not constant. It varies with spin up times, which are temperature and age dependent.

Using UUID discovers the needed information at mount time. It continues to work if you move a drive to a USB enclosure, or add more drives between existing drives.

All in all, UUID is robust.

A warning though. The kernel understands device numbers, the /dev/ names and PARTUUID. 

Using UUID to describe root on the kernel command line requires an initrd as the kernel does not understand UUID.

----------

## iandoug

Hi Neddy

System is detecting old drive first, in preference to new ones  :Smile: 

So is this better then?

```

# /dev/sda1

PARTUUID=b4f42392-c85d-b64c-8919-7166ddf4db1   /boot            vfat            defaults,noatime  0 2

# /dev/sda2

PARTUUID=e549871a-9a5d-8c4e-8459-a8dc4702dde9       none             swap            sw                0 0

# /dev/sda3

PARTUUID=6ecb5d48-00d6-ef4e-9ddc-6efbd75e448f         /                ext4            noatime           0 1

# /dev/md127

UUID=04d91321-b09c-4a74-ac73-a0c5f108ac96             /home            ext4            noatime           0 3 

```

Thanks, Ian

----------

## Hu

For filesystems other than root, you can test this live by unmounting the relevant filesystem, then using mount /home (for example).  mount is then forced to use fstab to discover that you want a UUID based mount, and in turn forced to do the UUID-driven search to find /home.

The PARTUUID for boot looks a character short.  Other than that, the identifiers seem to match with your earlier lsblk output, and you are correctly matching uuid to uuid and partuuid to partuuid.  (Giving one where the other is needed would fail, but you got this part correct.)

----------

## NeddySeagoon

iandoug,

In fstab, it makes no difference. To read fstab, root has to be mounted.

-- edit --

You may not umount /home if a normal user is logged in as it will be in use.

----------

## iandoug

Hi Guys

Thanks.

Do I need to fiddle with Grub at all? Or will this "just work" ?

Thanks, Ian

----------

## NeddySeagoon

iandoug,

What is on your kernel command line in grub.cfg?

----------

## iandoug

 *NeddySeagoon wrote:*   

> iandoug,
> 
> What is on your kernel command line in grub.cfg?

 

# Boot with network interface renaming disabled

GRUB_CMDLINE_LINUX="net.ifnames=0"

?

This be Grub 2 which is new to me, I think what you want to see is here:

```

### BEGIN /etc/grub.d/10_linux ###

menuentry 'Gentoo GNU/Linux' --class gentoo --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-14e0225a-638b-49fd-ae9d-d2f3a807fcec' {

        load_video

        insmod gzio

        insmod part_gpt

        insmod fat

        set root='hd0,gpt1'

        if [ x$feature_platform_search_hint = xy ]; then

          search --no-floppy --fs-uuid --set=root --hint-bios=hd0,gpt1 --hint-efi=hd0,gpt1 --hint-baremetal=ahci0,gpt1  8569-D7A3

        else

          search --no-floppy --fs-uuid --set=root 8569-D7A3

        fi

        echo    'Loading Linux 5.10.27-gentoo ...'

        linux   /vmlinuz-5.10.27-gentoo root=/dev/sda3 ro net.ifnames=0 

}

submenu 'Advanced options for Gentoo GNU/Linux' $menuentry_id_option 'gnulinux-advanced-14e0225a-638b-49fd-ae9d-d2f3a807fcec' {

        menuentry 'Gentoo GNU/Linux, with Linux 5.10.27-gentoo' --class gentoo --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.10.27-gentoo-advanced-14e0225a-638b-49fd-ae9d-d2f3a807fcec' {

                load_video

                insmod gzio

                insmod part_gpt

                insmod fat

                set root='hd0,gpt1'

                if [ x$feature_platform_search_hint = xy ]; then

                  search --no-floppy --fs-uuid --set=root --hint-bios=hd0,gpt1 --hint-efi=hd0,gpt1 --hint-baremetal=ahci0,gpt1  8569-D7A3

                else

                  search --no-floppy --fs-uuid --set=root 8569-D7A3

                fi

                echo    'Loading Linux 5.10.27-gentoo ...'

                linux   /vmlinuz-5.10.27-gentoo root=/dev/sda3 ro net.ifnames=0 

        }

        menuentry 'Gentoo GNU/Linux, with Linux 5.10.27-gentoo (recovery mode)' --class gentoo --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.10.27-gentoo-recovery-14e0225a-638b-49fd-ae9d-d2f3a807fcec' {

                load_video

                insmod gzio

                insmod part_gpt

                insmod fat

                set root='hd0,gpt1'

                if [ x$feature_platform_search_hint = xy ]; then

                  search --no-floppy --fs-uuid --set=root --hint-bios=hd0,gpt1 --hint-efi=hd0,gpt1 --hint-baremetal=ahci0,gpt1  8569-D7A3

                else

                  search --no-floppy --fs-uuid --set=root 8569-D7A3

                fi

                echo    'Loading Linux 5.10.27-gentoo ...'

                linux   /vmlinuz-5.10.27-gentoo root=/dev/sda3 ro single net.ifnames=0

        }

        menuentry 'Gentoo GNU/Linux, with Linux 5.10.27-gentoo.old' --class gentoo --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.10.27-gentoo.old-advanced-14e0225a-638b-49fd-ae9d-d2f3a807fcec' {

                load_video

                if [ "x$grub_platform" = xefi ]; then

                        set gfxpayload=keep

                fi

                insmod gzio

                insmod part_gpt

                insmod fat

                set root='hd0,gpt1'

                if [ x$feature_platform_search_hint = xy ]; then

                  search --no-floppy --fs-uuid --set=root --hint-bios=hd0,gpt1 --hint-efi=hd0,gpt1 --hint-baremetal=ahci0,gpt1  8569-D7A3

                else

                  search --no-floppy --fs-uuid --set=root 8569-D7A3

                fi

                echo    'Loading Linux 5.10.27-gentoo.old ...'

                linux   /vmlinuz-5.10.27-gentoo.old root=/dev/sda3 ro net.ifnames=0 

        }

        menuentry 'Gentoo GNU/Linux, with Linux 5.10.27-gentoo.old (recovery mode)' --class gentoo --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.10.27-gentoo.old-recovery-14e0225a-638b-49fd-ae9d-d2f3a807fcec' {

                load_video

                if [ "x$grub_platform" = xefi ]; then

                        set gfxpayload=keep

                fi

                insmod gzio

                insmod part_gpt

                insmod fat

                set root='hd0,gpt1'

                if [ x$feature_platform_search_hint = xy ]; then

                  search --no-floppy --fs-uuid --set=root --hint-bios=hd0,gpt1 --hint-efi=hd0,gpt1 --hint-baremetal=ahci0,gpt1  8569-D7A3

                else

                  search --no-floppy --fs-uuid --set=root 8569-D7A3

                fi

                echo    'Loading Linux 5.10.27-gentoo.old ...'

                linux   /vmlinuz-5.10.27-gentoo.old root=/dev/sda3 ro single net.ifnames=0

        }

}

```

Thanks, Ian

----------

## NeddySeagoon

iandoug,

```
vmlinuz-5.10.27-gentoo root=/dev/sda3 
```

and the lack of an initrd was what we wanted to see.

You should change root=/dev/sda3 to root=PARTUUID=...

as UUID will not work without an initrd.

Then you can randomise you HDD connections almost however you want and it will sill work.

I say 'almost' as there are some extra conditions for root on USB, which you don't need to know about for this.

----------

## iandoug

Hi Neddy

Thanks. I have been using old grub on all my boxes, this is first time with grub2 and so am a bit lost as to where things get specified. 

Do I change this

```

GRUB_CMDLINE_LINUX="net.ifnames=0"

```

in /etc/default/grub

to 

```

GRUB_CMDLINE_LINUX="root=PARTUUID=6ecb5d48-00d6-ef4e-9ddc-6efbd75e448f net.ifnames=0"

```

I see no mention of which partition to boot in /etc/default/grub. (As a noob, grub2 seems convoluted and confusing... things hidden away)

Thanks, Ian

----------

## iandoug

Sigh.

Remember, ‘GRUB_DISABLE_LINUX_PARTUUID’ and ‘GRUB_DISABLE_LINUX_UUID’ are also considered to be set to ‘false’ when they are unset. 

https://www.gnu.org/software/grub/manual/grub/html_node/Root-Identifcation-Heuristics.html

vs this, which says default is true:

GRUB_DISABLE_LINUX_PARTUUID 	true 	Since version 2.04. If false, and if there is either no initramfs or GRUB_DISABLE_LINUX_UUID is set to true, ${GRUB_DEVICE_PARTUUID} is passed in the root parameter on the kernel command line.

https://wiki.gentoo.org/wiki/GRUB2/Configuration_variables

Or am I reading it wrong?

Thanks, Ian

----------

## iandoug

 *mike155 wrote:*   

> iandoug, dmesg shows many hardware/bus errors. 
> 
> Fix the hardware before running fsck - otherwise it is possible that fsck deletes your data...

 

Okay, long story short... I modified grub config to have GRUB_DISABLE_LINUX_PARTUUID=false , 

 , regenerated and checked and installed the config, pleaded with the deities and rebooted.

So now the boot disk and raid detection are okay.

I added another blank disk as "sanity control", that gets picked up ok.

However when I add the problem disk, the boot process hangs and the raid is not set up properly.

I then plugged it into an older PC, (Manjero), with simpler setup and known good cables. The OS detects the drive, and fdisk is happy to work with it. Did not see any hardware errors in dmesg.

So I'm running fsck on it, it reports errors on disk, and started scanning.

Now says "Error reading block 99828 (input/output error) while getting next inode from scan. Ignore error<y>?"

I suppose I shall have to answer Yes, unless there is a better option.

I guess this means "bad sectors on disk" ?

Thanks, Ian

----------

## NeddySeagoon

iandoug,

If the drive worked elsewhere, its not the drive or data cable in use there.

That points to the SATA port on the motherboard (in the original system) being defective, or with a low probability, the power. 

Answer N to fsck. It can be a filesystem destruction tool. Its OK to look but not let it make changes.

Run the long test with smartclf for a low level surface scan, without data being exported over the interface.

dd the entire drive to /dev/null to do the same thing but using the SATA interface.

The filesystem is the next layer up. Do not test that unless one or both of the above tests pass.

Was the original drive a member of the RAID set.

It will have an old event count and mdadm won't be happy. It won't be happy with two members of the RAID set in the same slot either.

----------

## iandoug

 *NeddySeagoon wrote:*   

> 
> 
> If the drive worked elsewhere, its not the drive or data cable in use there.
> 
> That points to the SATA port on the motherboard (in the original system) being defective, or with a low probability, the power. 
> ...

 

Motherboard (this box) may be getting faulty, but think it may be drive too... the boot before All Hell Broke Loose also found and fixed errors with the drive.

 *NeddySeagoon wrote:*   

> iandoug,
> 
> Answer N to fsck. It can be a filesystem destruction tool. Its OK to look but not let it make changes.
> 
> 

 

That aborted fsck.

 *NeddySeagoon wrote:*   

> 
> 
> Run the long test with smartclf for a low level surface scan, without data being exported over the interface.
> 
> dd the entire drive to /dev/null to do the same thing but using the SATA interface.
> ...

 

I assume you mean smartctl which I have not used before in this way. Was not even installed on the Manjaro box.

Short test passed, running long test, will take almost 3 hours. A "verbose" option would be nice.

I understand your next step as "copy all contents to nowhere" ... will try that later. Power is going off again shortly after the long test finishes.

 *NeddySeagoon wrote:*   

> 
> 
> Was the original drive a member of the RAID set.
> 
> It will have an old event count and mdadm won't be happy. It won't be happy with two members of the RAID set in the same slot either.

 

No, is stand-alone drive, but its issues somehow confused Raid detection on my new box. 

Thanks, Ian

----------

## NeddySeagoon

iandoug,

fsck is a last resort, after you have validated backups or a ddrescue image of the drive.

While the interface with the drive is under a clould, you know nothing about the the content of the drive with any certainty.

That being the case, its not safe to let fsck make any changes.

The long test does a "copy all contents to nowhere" but the data never leaves the drive.

If that fails, the drive is probably scrap because it can't read its own writing but see later.

If the long test passes the internals of the drive are OK and the  "copy all contents to nowhere" copies the entire contents of the drive over the interface.

That checks the interface separately from the rest of the drive.

I said above, that the drive is probably scrap if it can't read its own writing. That means the failing sector remapping hasn't work.

The idea is that drives move data from a failing sector to a spare sector when they detect read problems.

This remapping can be forced with a write to the 'failed' sector.

Either the write will succeed to the original sector, or the write will fail to the original sector and the drive will write it to a space sector.

It appears to succeed in either case.

You can tell from the smart data. Has the Pending Sector Count or the Reallocated Sector Count changed? 

Long stot to explain the odd behaviour. HAve you ever dd imaged the drive or any of its partitions.

That duplicates UUIDs and having duplicate UUIDs in the same system is a verybadthing.

----------

## iandoug

@Neddy

I checked the results of the long scan after the specified time via smartctl -a but cold not see anything.... just showed some errors that occurred  335 days after power on which would have likely been Long Ago.

So maybe it was still busy with the long scan. (Had to shut down for loadshedding)

Running it again with -C option.

Have never dd the drive anywhere. Tha PARTUUID as shown by blkid is a shorter string than my other drives ... don't know if that is because it is older.

Thanks, Ian

----------

## NeddySeagoon

iandoug,

PARTUUID is faked with MSDOS partition tables. Its 32 bit volumeID followed by the partition number.

It may not even be a constant for logical partitions. 

The test results are show in 

```
smatctl -a
```

under the parameter table.

```
SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0

  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0

  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       12368

  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       43

  5 Reallocated_Sector_Ct   0x0033   100   100   050    Pre-fail  Always       -       0

  7 Seek_Error_Rate         0x000b   100   100   050    Pre-fail  Always       -       0

  8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0

  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       428

 10 Spin_Retry_Count        0x0033   100   100   030    Pre-fail  Always       -       0

 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       43

191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0

192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       9

193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       50

194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       27 (Min/Max 12/35)

196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0

197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0

220 Disk_Shift              0x0002   100   100   000    Old_age   Always       -       101449732

222 Loaded_Hours            0x0032   100   100   000    Old_age   Always       -       350

223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0

224 Load_Friction           0x0022   100   100   000    Old_age   Always       -       0

226 Load-in_Time            0x0026   100   100   000    Old_age   Always       -       510

240 Head_Flying_Hours       0x0001   100   100   001    Pre-fail  Offline      -       0

SMART Error Log Version: 1

No Errors Logged

SMART Self-test log structure revision number 1

No self-tests have been logged.  [To run self-tests, use: smartctl -t]
```

----------

## iandoug

Hi Neddy

smartctl -a before long scan and after

before

```

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0

  3 Spin_Up_Time            0x0003   253   169   021    Pre-fail  Always       -       2325

  4 Start_Stop_Count        0x0032   083   083   000    Old_age   Always       -       17349

  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0

  7 Seek_Error_Rate         0x000e   200   097   051    Old_age   Always       -       0

  9 Power_On_Hours          0x0032   001   001   000    Old_age   Always       -       74872

 10 Spin_Retry_Count        0x0012   100   100   051    Old_age   Always       -       0

 11 Calibration_Retry_Count 0x0012   100   100   051    Old_age   Always       -       0

 12 Power_Cycle_Count       0x0032   093   093   000    Old_age   Always       -       7253

192 Power-Off_Retract_Count 0x0032   186   186   000    Old_age   Always       -       10524

193 Load_Cycle_Count        0x0032   197   197   000    Old_age   Always       -       11321

194 Temperature_Celsius     0x0022   116   092   000    Old_age   Always       -       34

196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0

197 Current_Pending_Sector  0x0012   200   191   000    Old_age   Always       -       29

198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

200 Multi_Zone_Error_Rate   0x0008   200   179   051    Old_age   Offline      -       0

SMART Error Log Version: 1

ATA Error Count: 308 (device log contains only the most recent five errors)

        CR = Command Register [HEX]

        FR = Features Register [HEX]

        SC = Sector Count Register [HEX]

        SN = Sector Number Register [HEX]

        CL = Cylinder Low Register [HEX]

        CH = Cylinder High Register [HEX]

        DH = Device/Head Register [HEX]

        DC = Device Command Register [HEX]

        ER = Error register [HEX]

        ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 308 occurred at disk power-on lifetime: 8059 hours (335 days + 19 hours)

  When the command that caused the error occurred, the device was active or idle.

```

after

```

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0

  3 Spin_Up_Time            0x0003   253   169   021    Pre-fail  Always       -       1500

  4 Start_Stop_Count        0x0032   083   083   000    Old_age   Always       -       17371

  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0

  7 Seek_Error_Rate         0x000e   200   097   051    Old_age   Always       -       0

  9 Power_On_Hours          0x0032   001   001   000    Old_age   Always       -       74875

 10 Spin_Retry_Count        0x0012   100   100   051    Old_age   Always       -       0

 11 Calibration_Retry_Count 0x0012   100   100   051    Old_age   Always       -       0

 12 Power_Cycle_Count       0x0032   093   093   000    Old_age   Always       -       7275

192 Power-Off_Retract_Count 0x0032   186   186   000    Old_age   Always       -       10546

193 Load_Cycle_Count        0x0032   197   197   000    Old_age   Always       -       11343

194 Temperature_Celsius     0x0022   102   092   000    Old_age   Always       -       48

196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0

197 Current_Pending_Sector  0x0012   200   191   000    Old_age   Always       -       29

198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

200 Multi_Zone_Error_Rate   0x0008   200   179   051    Old_age   Offline      -       0

SMART Error Log Version: 1

ATA Error Count: 308 (device log contains only the most recent five errors)

```

Don't know if that will tell you anything useful.

Thanks, Ian

----------

## NeddySeagoon

iandoug,

```
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE 

  9 Power_On_Hours          0x0032   001   001   000    Old_age   Always       -       74872

196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0

197 Current_Pending_Sector  0x0012   200   191   000    Old_age   Always       -       29 
```

At 74,872 running hours, it an old drive.

It has 29 different sectors that it's tried to read but can't, so it cant reallocate them.

It has not tried to reallocate anything yet, probably because it can't read the sectors that need to be reallocated.

All the smart values pass but a non zero Current_Pending_Sector count is a no questions asked warranty return.

Except your warranty will be expired by now.

That 29 unreadable sectors is a minimum. There may be more that its not tried yet.

I would expect the long test to fail on that drive.

Do you need the data on that drive? 

There may be ways and means to make an image and coax one more read out of some or all of the failed sectors if you do.

----------

## iandoug

 *NeddySeagoon wrote:*   

> 
> 
> Do you need the data on that drive? 
> 
> There may be ways and means to make an image and coax one more read out of some or all of the failed sectors if you do.

 

It's a 500GB drive, but was probably mostly empty.

Can't remember exactly what was on there, I had recently moved some stuff from a bigger hard drive there while I cleaned up the bigger one.

If I can get it to mount so that I can see exactly what is on it, then I can make a rational decision.

I have not tried to mount it the box its in at the moment, because fsck may kick in.

What happens if I tell fsck to ignore the "can't follow chain" error and just keep going? Saving some is better than nothing. Or at least to see exactly what is on there.

Must see about getting my tape drive fixed ...

Thanks, Ian

----------

## iandoug

@Neddy ... sorry to bump.

 *NeddySeagoon wrote:*   

> 
> 
> What happens if I tell fsck to ignore the "can't follow chain" error and just keep going? Saving some is better than nothing. Or at least to see exactly what is on there.
> 
> 

 

Thanks, Ian

----------

## NeddySeagoon

iandoug,

When you tell fsck not to do something, it exits. There is no "don't do this but continue anyway".  

If you have 500G spare somewhere make an image of the failing drive with ddrescue.

The blocks that are easy to read will be read first then ddrescue will try very hard to coax one more read out of the faulty sectors, which is all you need.

It does a lot more that the kernel drivers.

ddrescue is like dd but it deals with errors, which is why it exists. 

You must create the log file. ddrescue can be restarted and uses the log to know what has already been recovered.

Restarts will probably be required as you will want to change the options you give.

You can help the data recovery too but more on that after you share the log file when ddrescue halts the first time.

Once you have as much of an image as you think you are going to get, (or are prepared to wait for) we can try mounting the recovered image.

If fdisk hasn't done any damage itself, we may well coax that to work. For safety, it will be a read only mount but you will be able to copy thing.

Mount has lots of options to tell it to try harder too.

----------

## iandoug

 *NeddySeagoon wrote:*   

> iandoug,
> 
> When you tell fsck not to do something, it exits. There is no "don't do this but continue anyway".  
> 
> If you have 500G spare somewhere make an image of the failing drive with ddrescue.
> ...

 

I have a spare 2TB drive.

Will try tomorrow, thanks.Power going off again shortly.

Re fsck, I was thinking along the lines of "ignore error and carry on checking the rest.. " based on the assumption there are a limited number of bad sectors, and maybe not even data on most of them... then I can copy what I can.

Thanks, Ian

----------

## NeddySeagoon

iandoug,

The bad sectors may contain filesystem metadata.

e.g. the i-node table. the root directory. You get nothing through the filesystem without them.

The may be data sectors. They file they belong to is damaged.

We do know that they are used though, because the drive tried to access them.

If it hadn't, it would not know they were causing a problem. 

A disk image in a file works. The various partitions can me mounted with losetup, to get a loopback device with partitions and loop mounting those partitions.

----------

## iandoug

Hi Neddy

Thanks for ddrescue.

I was eventually able to recover the whole drive. No bad sectors. Instead it seems the electronics is overheating which caused read failures.

Took a few experiments in cooling to find the best solution, which was a standing floor fan blowing at max speed to the underside of the drive standing on its edge.

After that I cloned the saved copy to another disk, and let fsck do its thing.  It did fix some errors. Looks like most if not all content was saved, and it was stuff I needed.

Am now going to set up another RAID1 in the box for use as "data disks" and try to keep less used stuff on the NAS boxes.

Thanks for all your help.

----------

## NeddySeagoon

iandoug,

lm-sensors can tell you the hdd temperature. e.g.

```
Adapter: SCSI adapter

temp1:        +27.0°C  (low  =  +5.0°C, high = +55.0°C)

                       (crit low = -40.0°C, crit = +70.0°C)

                       (lowest = +15.0°C, highest = +27.0°C)

drivetemp-scsi-0-0

Adapter: SCSI adapter

temp1:        +28.0°C  (low  =  +5.0°C, high = +55.0°C)

                       (crit low = -40.0°C, crit = +70.0°C)

                       (lowest = +15.0°C, highest = +28.0°C)
```

For some drives its also in the output of 

```
smartctl -x ...
```

----------

## figueroa

@iandoug

Thanks for the closure. I have long used cooling fans going back to Commodore 1541 floppy disk drives in the summer.

----------

