# One single partition on SATA Drive giving I/O Errors

## BlastM

Hi Gentooforumers,

I have a particularly odd problem.

Here's the situation: I have an NForce 3 250Gb board and have 2 ATA harddrives and 2 SATA harddrives running off it.

A few weeks ago I replaced the cheapo powersupply with an Antec Smartpower 400W and installed a second SATA drive (both Seagate, the second being 250GB and the older one 200GB). This setup had been rock solid up until a few days ago.

Then suddenly I started getting errors writing to an Ext3 partition on the older SATA drive. "Read Only Filesystem" which was weird, and I tried "mount -o remount,rw" which didn't work, so I rebooted.

The partition, sdb5, wouldn't boot on startup, but the swap partition and the VFAT partition also on the drive mount fine, and continue to do so up to this very point. Doing a manual mount complains about unspecified filesystem, and doing "mount -t ext3" resulted in "Bad superblock" and the following dmesg output:

```
ata2: command 0x25 timeout, stat 0xd0 host_stat 0x21

ata2: translated ATA stat/err 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00

ata2: status=0xd0 { Busy }

sd 1:0:0:0: SCSI error: return code = 0x8000002

sdb: Current: sense key: Aborted Command

    Additional sense: Scsi parity error

end_request: I/O error, dev sdb, sector 1073

EXT3-fs: unable to read superblock
```

So I emerged dd-rhelp and tried to dump the entire filesystem to another disk. The dumped image got up to 12MB in size, in the first few seconds, then never got any larger. I left it to run for several hours but the only thing that happened was this sort of stuff in dmesg:

```
ata2: status=0xd0 { Busy }

sd 1:0:0:0: SCSI error: return code = 0x8000002

sdb: Current: sense key: Aborted Command

    Additional sense: Scsi parity error

end_request: I/O error, dev sdb, sector 327681075

Buffer I/O error on device sdb5, logical block 163840002

ata2: command 0x25 timeout, stat 0xd0 host_stat 0x21

ata2: translated ATA stat/err 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00
```

being repeated over and over.

I had a lot of the data backed up but there is still alot of data I really want to recover.

I'm not sure whether this is a hardware or filesystem problem, because the VFAT and the swap partition mount fine and can be read and written to, with no output in dmesg. At the moment I'm using an Ubuntu 64-bit LiveCD, but I've also tried Knoppix and also the 32-bit Gentoo install (which still boots fine as it's on another drive) all to the same result. I've also swapped the SATA drives, cables, used one at a time etc, and my next job will be trying the problem drive in another machine.

I'd like to salvage what I can. Your help is appreciated and I apologise for the lengthy post.   :Very Happy: 

Thanks,

Hamish.

----------

## NeddySeagoon

BlastM,

It looks like the drive goes off-line after (or because of) the error that will amke things difficult.

You can try to run 

```
e2fsck -n -b 32768
```

 That won't fix anything, it will just check the filesystem using the first backup superblock. If it runs to completion, there is a good chance of getting your data back.

Read 

```
man e2fsck
```

before you use the command and understand what it does.

I think mount can use alternate superblocks too but thats not so useful.

Since you don't have a backup, the main thing is not to make things worse by allowing fsck to chaneg the filesystem until you have some idea of the changes it wants to make. 

You may want to 

```
emerge smartmontools
```

 and attempt to read the drives internal errro log. Your version of libsata (in the kernel) needs to support this. There has been a patch around for some time but I don't know if its in the kernel yet.

----------

## BlastM

Hi NeddySeagoon,

I should have mentioned that I had searched the forums before posting, and had already done a long test using smartctl. I have no idea what the ouput means so I didn't think much of it, but there were errors in dmesg whilst it was going.

"mkfs.ext3 -n /dev/sdb5" says it would create backup superblocks at 32768... etc up to 23887872. When I try "fsck.ext3 -b 4096000 -n /dev/sdb5" for example, dmesg starts showing the same errors as above and fsck.ext3 would have to be kill -9'd. There is very little harddrive activity when this is going on.

Still, I don't understand this problem. If it's a dying drive, why are the other partitions perfectly fine? (The data on these partitions is replacable so I'm not going to bother backing up, no I'm not totally stupid  :Wink:  ).

This is probably unrelated but worth mentioning, I did an emerge -u world a few days before this happened, and on boot I get alot of messages like the following:

```
Jul 26 21:36:12 blastbox udevd-event[6615]: run_program: exec of program '/sbin/udev_run_hotplugd' failed

Jul 26 21:36:12 blastbox udevd-event[6616]: run_program: exec of program '/sbin/udev_run_devd' failed
```

repeated over and over, but I don't think this is causing the problem (I'll fix it later).

Thanks for your help so far, it's appreciated!   :Very Happy: 

----------

## NeddySeagoon

BlastM,

Silly question time. 4096000 is a strange number for a backup superblock, they are nomally powers of 2.

Where did you get that from ?

I don't think your udev messages are related - I get them too.

You disk can go bad in one small area. Total failure is normally caused by the platter bearings failing, you normaly notice that, as the drive becomes noisy and 'clatters' or 'clicks' as the heads recalibrate and reseek.

Small areas, like blocks or tracks going bad are mapped out by using spares. You can see that on the smartmontools output.

A drive that has used all its spare sectors is end of life.

----------

## Keruskerfuerst

1. Check the power cables to the HDDs

2. Check the SATA cables to the drives

3. Repair the the superblock with e2fsck -n -b 8193 

----------

## Akkara

Not directly relevant to your immediate recovery problem, but something to look out for:

I had had trouble with the SATA cables in the past.  It seems certain combinations of drives and cables have a tendency to not make good contact and after a while I started getting errors. I had gotten quite a few  *Quote:*   

> Scsi parity error

  in dmesg output, which stopped when I jiggled the cable.

Perhaps a similar issue is what led to your initial corruption?

I eventually fixed it by getting a new cable and by routing it and tieing it down with wire ties so as to put a slight force pushing the contacts side of the cable into the drive.  I haven't had any troubles since making this fix.

Good luck!

----------

## Keruskerfuerst

The standart SATA cables have a engeneering fault.

The data and the handshake wire have a common shield.

This is a big mistake.

The data and the handshake wire must have a seperate shield.

----------

## BlastM

 *NeddySeagoon wrote:*   

> BlastM,
> 
> Silly question time. 4096000 is a strange number for a backup superblock, they are nomally powers of 2.
> 
> Where did you get that from ?

 

By running this:

```
blastbox blastm # mkfs.ext2 -n /dev/sdb5

mke2fs 1.38 (30-Jun-2005)

Filesystem label=

OS type: Linux

Block size=4096 (log=2)

Fragment size=4096 (log=2)

21790720 inodes, 43567233 blocks

2178361 blocks (5.00%) reserved for the super user

First data block=0

1330 block groups

32768 blocks per group, 32768 fragments per group

16384 inodes per group

Superblock backups stored on blocks: 

        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 

        4096000, 7962624, 11239424, 20480000, 23887872

blastbox blastm #
```

 *Quote:*   

> You disk can go bad in one small area. Total failure is normally caused by the platter bearings failing, you normaly notice that, as the drive becomes noisy and 'clatters' or 'clicks' as the heads recalibrate and reseek.
> 
> Small areas, like blocks or tracks going bad are mapped out by using spares. You can see that on the smartmontools output.
> 
> A drive that has used all its spare sectors is end of life.

 

There's definitely no dying harddrive noises. I'll run smartctl again and get post the output.

 *Akkara wrote:*   

> Not directly relevant to your immediate recovery problem, but something to look out for:
> 
> I had had trouble with the SATA cables in the past.  It seems certain combinations of drives and cables have a tendency to not make good contact and after a while I started getting errors. I had gotten quite a few  *Quote:*   Scsi parity error  in dmesg output, which stopped when I jiggled the cable.
> 
> Perhaps a similar issue is what led to your initial corruption?
> ...

 

I had tried it with the cable for the other drive and plugged into the other SATA port on the motherboard: same problem. I've just turned off my server so I'll put it in there and see what happens.

Cheers everyone!

----------

## Keruskerfuerst

I have bought high quality SATA cables from Revoltec with a seperate shield for each handshake and data transfer wire.

My SATA drives do work fine now.

----------

## BlastM

Keruskerfuerst: so you were getting similar errors to me, which went away when you got the fancy cables?

OK so right now the harddrive is in my Dell server, running Knoppix because there's only one SATA cable in there so I had to unplug the other harddrive.

Different motherboard, cable, power supply and operating system. Same error when I run "e2fsck -n -b 8193 /dev/sda5".

```
ata1: command 0xc8 timeout, stat 0xd0 host_stat 0x21

ata1: translated ATA stat/err 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00

ata1: status=0xd0 { Busy }

sd 0:0:0:0: SCSI error: return code = 0x8000002

sda: Current: sense key=0xb

    ASC=0x47 ASCQ=0x0

Info fld=0x442f

end_request: I/O error, dev sda, sector 17455

Buffer I/O error on device sda5, logical block 8192
```

The error keeps repeating, with the logical block increasing each time (8193, 8194, 8195, then 16384, 16385, 16386...)

----------

## Keruskerfuerst

I am running Gentoo 2006.0 and Suse 10.0 64 Bit on a AMD Athlon 64 3200 with a nForce 3 250 Mainboard from MSI. There are 2 Samsung SP1614C HDDs connected.

The drives did not return errors with the standart SATA cables, but i´t seemed, that the drives could not read or write some sectors.

After switching to the Revoltec SATA cables (they are transparent: you can can see the shield for each wire), the HDDs do work without any problems.Last edited by Keruskerfuerst on Sat Jul 29, 2006 6:58 am; edited 2 times in total

----------

## BlastM

 *Keruskerfuerst wrote:*   

> I am running Gentoo 2006.0 and Suse 10.0 64 Bit on a AMD Athlon 64 3200 with a nForce 3 250 Mainboard from MSI. There are 2 Samsung SP1614C cables connected.
> 
> The drives did not return errors with the standart SATA cables, but i´t seemed, that the drives could not read or write some sectors.
> 
> After switching to the Revoltec SATA cables (they are transparent: you can can see the shield for each wire), the HDDs do work without any problems.

 

Thanks heaps mate, I'll source these cables immediately. It's pretty much my last resort, at least before I try using Windows XP and a dd-like utility there, or OpenBSD or something.

----------

## gasparov

Hi,

   could it be temperatures?  Is the biggest reason for new drives failing the last two months I had two hdd with various I/O errors and temperature was almost 55-60 C (maxtor)..After that I went for raid

I suggest you ddrescue,it works well....

----------

## NeddySeagoon

BlastM,

```
e2fsck -n -b 8193 /dev/sda5
```

is not right for you  The 8193 should be any one of the numbers from your earlier post

```
Superblock backups stored on blocks:

        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,

        4096000, 7962624, 11239424, 20480000, 23887872 
```

Oh, and -n won't fix it. if e2fsck runs clean when called with the -n -b options and a valid number, you must run it again without the -n to do the repair. 

You really do need to know that thats the only change that e2fs is going to make to your partition. It can make things much worse if it wants to do other things.

----------

## Keruskerfuerst

You should run e2fsck -b xxx.

xxx= depends on the block size.

1K blocksize: xxx=8193

2K blocksize: xxx=16384

4K blocksize: xxx=32768

----------

## BlastM

I did mention that I tried e2fsck -b $BACKUPSUPERBLOCK -n /dev/sda5 with a number of the superblocks that mkfs.ext3 -n /dev/sda5 said it would create.

There was no feedback from e2fsck and the harddrive activity LED only flashed intermittently, whilst dmesg was reporting all those errors. After 20 mins of this I eventually did a killall -9 e2fsck. I tried this a few times.

However the state of the filesystem structure is not my main concern right now. I just want to dd all the bits off there before it gets worse (if that is possible). I've tried many kernels with libata support which aren't co-operating, so my next step will be trying BSD and then Windows   :Shocked:  simply because they have different implementations of SATA and might be less sensitive to this problem. I'll also buy the cables mentioned above.

Thanks everyone.

----------

## BlastM

On that note, can anyone recommend me a BSD live CD that has support for SATA and something like dd or dd_rescue? I wouldn't have the first clue about what to look for in a BSD distro.

----------

## NeddySeagoon

BlastM,

You want dd_rhelp, which is a wrapper for dd_rescue, it takes away the babysitting that dd_rescuce needs.

I'm not aware of a liveCD with it though.

----------

## BlastM

OK so I gave FreeBSD a go (more specifically, the Frenzy 1.0 Live CD).

It saw the partitions and so forth, wouldn't mount the ext3 filesystem but mounted the vfat partition. The CD also had dd_rescue on it, so I was able to download and compile dd_rhelp pretty easily.

The error was pretty much similar to the one in Gentoo:

```
=== COMPUTED VERSION OF LOG :

chunk:

logcontent:

eof:nothing

=== launched via 'dd_rhelp' at 0k, 0 >>> ===

dd_rescue: (info): ipos:     12688.0k, opos:     12688.0k, xferd:     12688.0k

                   errs:      0, errxfer:         0.0k, succxfer:     12688.0k

             +curr.rate:     1083kB/s, avg.rate:     2838kB/s, avg.load:  0.9%

dd_rescue: (info): ipos:     12703.0k, opos:     12703.0k, xferd:     12703.0k

                *  errs:      0, errxfer:         0.0k, succxfer:     12703.0k

             +curr.rate:        4kB/s, avg.rate:     1527kB/s, avg.load:  0.5%

dd_rescue: (warning): /dev/ad6s5 (12703.0k): Input/output error!

dd_rescue: (info): ipos:     12703.5k, opos:     12703.5k, xferd:     12703.5k

                *  errs:      1, errxfer:         0.5k, succxfer:     12703.0k

             +curr.rate:        0kB/s, avg.rate:     1061kB/s, avg.load:  0.3%

dd_rescue: (warning): /dev/ad6s5 (12703.5k): Input/output error!

dd_rescue: (info): ipos:     12717.5k, opos:     12717.5k, xferd:     12717.5k

                *  errs:      2, errxfer:         1.0k, succxfer:     12716.5k

             +curr.rate:        4kB/s, avg.rate:      810kB/s, avg.load:  0.2%

dd_rescue: (warning): /dev/ad6s5 (12717.5k): Input/output error!

dd_rescue: (info): ipos:     12719.5k, opos:     12719.5k, xferd:     12719.5k

                *  errs:      3, errxfer:         1.5k, succxfer:     12718.0k

             +curr.rate:        0kB/s, avg.rate:      409kB/s, avg.load:  0.1%

dd_rescue: (warning): /dev/ad6s5 (12719.5k): Device not configured!

dd_rescue: Last error fatal! Exiting ...
```

The command read the first 12MB, as does dd_rhelp in Gentoo, after which the program bailed (instead of hanging indefinitely) and the entire harddrive disappeared, unable to be accessed at all until the computer was rebooted.

Here are the relevant dmesg excerpts.

```
# Detecting my two SATA drives on boot

ad4: 238475MB <Seagate ST3250820AS 3.AAC> at ata2-master SATA150

ad6: 190781MB <Seagate ST3200822AS 3.01> at ata3-master SATA150
```

```
# Trying to mount the dying partition

WARNING: mount of ad6s5 denied due to unsupported optional features

WARNING: mount of ad6s5 denied due to unsupported optional features

WARNING: mount of ad6s5 denied due to unsupported optional features

WARNING: mount of ad6s5 denied due to unsupported optional features
```

```
# Running dd_rhelp

ad6: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE> LBA=26415

ad6: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE> LBA=26477

ad6: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE> LBA=26478

ad6: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE> LBA=26506

ad6: FAILURE - device detached

subdisk6: detached

ad6: detached
```

I don't know much about FreeBSD but I Googled those READ_DMA error codes and nothing related to my problem came up.

I think the drive's fucked.

----------

## NeddySeagoon

BlastM,

The drive goes busy when you get the error and there is not way out except a reboot.

However, dd_rhelp should start from a new place on the drive ane read more information to add to what its already got, on each subsequent run. If it doesn't, you will  have to babysit dd_rescue and do the binary seach by hand.

----------

## BlastM

Yeah it should, but it doesn't.

I think I'll have to spend the weekend with a pot of coffee and a comfy chair, playing with dd_rescue.

Thanks for your help Neddy.   :Smile: 

----------

