# [Solved] SATA drives mess

## creaker

I faced with an amazing error. It even isn't an error, I do not know how to classify it.

OK, I have a server that running samba. Box has two sata drives onboard - sda and sdb.

I listened to music over the samba share and suddenly amarok stoped to play. I checked for shares and found that one of two shares disappears, mount doesn't show share mounted. OK, I logged into server over ssh and checked samba - seems it works. Dmesg doesn't show any fault. But running blkid I noticed very strange thing: one of my drives for some strange reason became /dev/sdc... Initially it was  /dev/sda.

```
server ~ # blkid

/dev/sdb1: LABEL="GENTOO" UUID="63731ef5-a607-45c0-8cfb-36b11ad07a1f" TYPE="ext4" PARTUUID="0d310d30-01" 

/dev/sdb2: LABEL="TORR" UUID="31d9a576-545f-41f1-a859-e5482b2a421c" TYPE="ext4" PARTUUID="0d310d30-02" 

/dev/sdc1: LABEL="SHARE" UUID="b53719e3-20c0-468b-95eb-f54b4b7b9844" TYPE="ext4" PARTUUID="805607c7-01"
```

fdisk -l shows the same devices: sdb & sdc instead of sda & sdb.

I rebooted the server and checked it again - it is sda & sdb as it should be:

```
server ~ # blkid

/dev/sda1: LABEL="SHARE" UUID="b53719e3-20c0-468b-95eb-f54b4b7b9844" TYPE="ext4" PARTUUID="805607c7-01" 

/dev/sdb1: LABEL="GENTOO" UUID="63731ef5-a607-45c0-8cfb-36b11ad07a1f" TYPE="ext4" PARTUUID="0d310d30-01" 

/dev/sdb2: LABEL="TORR" UUID="31d9a576-545f-41f1-a859-e5482b2a421c" TYPE="ext4" PARTUUID="0d310d30-02"

```

Last week I got such a joke 5 or 6 times.

Who and why shifts drives enumeration (from sda to sdc)?Last edited by creaker on Sun Sep 21, 2014 5:01 pm; edited 1 time in total

----------

## lagalopex

Have a look at the output of dmesg when this happens.

You may also check your sda/sdc's smart values. (sys-apps/smartmontools => smartctl -a /dev/sda) and perhaps running some tests.

----------

## Anon-E-moose

 *creaker wrote:*   

> I faced with an amazing error. It even isn't an error, I do not know how to classify it
> 
> ...
> 
> Who and why shifts drives enumeration (from sda to sdc)?

 

That can happen when the sata controller quits talking to the device (whatever reason) then reconnects.

There should have been a corresponding sata message in dmesg if that was the case.

I've seen it when the power supply wasn't working properly (I replaced it and problem went away)

or if one is overclocking the bus (sata controllers).

I suppose it could also happen if some component of the mb is going bad (south bridge IIRC)

----------

## creaker

ОК, have to wait for it occurs once more to check dmesg, though I checked dmesg and doesn't notice any message added to it since boot.

----------

## lagalopex

The smart values can be checked at any time, the scores are saved within the disks firmware.

Your sig mentions 250Gb hdd... this size is quite old. Maybe the disks just fail? (=> look at the smart values!)

----------

## creaker

I do not see any errors in smartctl output. Complete smart report:

http://pastebin.com/Ba11HCXp

No any fault or slowness on data accessing/filesystem navigation, nothing unusual.

----------

## NeddySeagoon

creaker,

The failure mechanism is that the drive is lost to the system but udev has not noticed the disconnect before the drive comes back.

As /dev/sda han not been released, the drive is allocated /dev/sdc.  When udev catches up, it removes /dev/sda

I would expect dmesg to contain a similar set of messages as it produces for a USB drive disconnect and reconnect, unless the box rebooted.

Is the uptime correct? 

The smart log looks good.

----------

## creaker

I've caught it. The dmesg tail:

```
[  958.706670] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x380000 action 0x6

[  958.706673] ata1.00: BMDMA stat 0x25

[  958.706676] ata1: SError: { 10B8B Dispar BadCRC }

[  958.706679] ata1.00: failed command: READ DMA

[  958.706684] ata1.00: cmd c8/00:00:88:61:1a/00:00:00:00:00/ef tag 0 dma 131072 in

         res 51/84:f0:98:61:1a/00:00:00:00:00/ef Emask 0x10 (ATA bus error)

[  958.706686] ata1.00: status: { DRDY ERR }

[  958.706688] ata1.00: error: { ICRC ABRT }

[  958.706694] ata1: hard resetting link

[  958.706695] ata1: nv: skipping hardreset on occupied port

[  959.190575] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

[  959.250708] ata1.00: model number mismatch 'ST3160318AS' != '3160318AS                             \xffffff80'

[  959.250712] ata1.00: revalidation failed (errno=-19)

[  959.250716] ata1: limiting SATA link speed to 1.5 Gbps

[  959.250718] ata1.00: limiting speed to UDMA/133:PIO3

[  964.193775] ata1: hard resetting link

[  964.193776] ata1: nv: skipping hardreset on occupied port

[  964.684112] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

[  964.704224] ata1.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)

[  964.704228] ata1.00: revalidation failed (errno=-5)

[  964.704229] ata1.00: disabled

[  969.687311] ata1: hard resetting link

[  970.597921] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

[  970.618030] ata1.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)

[  975.601123] ata1: hard resetting link

[  981.554949] ata1: link is slow to respond, please be patient (ready=0)

[  985.637577] ata1: SRST failed (errno=-16)

[  985.637580] ata1: hard resetting link

[  991.591413] ata1: link is slow to respond, please be patient (ready=-19)

[  995.674039] ata1: SRST failed (errno=-16)

[  995.674042] ata1: hard resetting link

[ 1001.627872] ata1: link is slow to respond, please be patient (ready=-19)

[ 1030.746621] ata1: SRST failed (errno=-16)

[ 1030.746624] ata1: limiting SATA link speed to 1.5 Gbps

[ 1030.746627] ata1: hard resetting link

[ 1031.627525] ata1: SATA link down (SStatus 100 SControl 310)

[ 1031.627542] sd 0:0:0:0: [sda]  

[ 1031.627543] Result: hostbyte=0x00 driverbyte=0x08

[ 1031.627545] sd 0:0:0:0: [sda]  

[ 1031.627546] Sense Key : 0xb [current] [descriptor]

[ 1031.627549] Descriptor sense data with sense descriptors (in hex):

[ 1031.627550]         72 0b 47 00 00 00 00 0c 00 0a 80 00 00 00 00 00 

[ 1031.627555]         0f 1a 61 98 

[ 1031.627558] sd 0:0:0:0: [sda]  

[ 1031.627559] ASC=0x47 ASCQ=0x0

[ 1031.627562] sd 0:0:0:0: [sda] CDB: 

[ 1031.627562] cdb[0]=0x28: 28 00 0f 1a 61 88 00 01 00 00

[ 1031.627568] end_request: I/O error, dev sda, sector 253387144

[ 1031.627579] sd 0:0:0:0: rejecting I/O to offline device

[ 1031.627581] sd 0:0:0:0: killing request

[ 1031.627586] ata1: EH complete

[ 1031.627594] sd 0:0:0:0: rejecting I/O to offline device

[ 1031.627596] sd 0:0:0:0: [sda] killing request

[ 1031.627599] sd 0:0:0:0: rejecting I/O to offline device

[ 1031.627606] sd 0:0:0:0: [sda] Unhandled error code

[ 1031.627608] sd 0:0:0:0: [sda]  

[ 1031.627609] Result: hostbyte=0x01 driverbyte=0x00

[ 1031.627610] sd 0:0:0:0: [sda] CDB: 

[ 1031.627611] cdb[0]=0x28: 28 00 0f 1a 62 88 00 01 00 00

[ 1031.627615] end_request: I/O error, dev sda, sector 253387400

[ 1031.627637] sd 0:0:0:0: rejecting I/O to offline device

[ 1031.627641] sd 0:0:0:0: rejecting I/O to offline device

[ 1031.627645] sd 0:0:0:0: rejecting I/O to offline device

[ 1031.627647] sd 0:0:0:0: [sda] READ CAPACITY(16) failed

[ 1031.627649] sd 0:0:0:0: [sda]  

[ 1031.627650] Result: hostbyte=0x01 driverbyte=0x00

[ 1031.627652] sd 0:0:0:0: [sda] Sense not available.

[ 1031.627655] sd 0:0:0:0: rejecting I/O to offline device

[ 1031.627658] sd 0:0:0:0: rejecting I/O to offline device

[ 1031.627661] sd 0:0:0:0: rejecting I/O to offline device

[ 1031.627663] sd 0:0:0:0: [sda] READ CAPACITY failed

[ 1031.627665] sd 0:0:0:0: [sda]  

[ 1031.627665] Result: hostbyte=0x01 driverbyte=0x00

[ 1031.627667] sd 0:0:0:0: [sda] Sense not available.

[ 1031.627670] sd 0:0:0:0: rejecting I/O to offline device

[ 1031.627673] sd 0:0:0:0: rejecting I/O to offline device

[ 1031.627678] sd 0:0:0:0: rejecting I/O to offline device

[ 1031.627681] sd 0:0:0:0: rejecting I/O to offline device

[ 1031.627684] sd 0:0:0:0: [sda] Asking for cache data failed

[ 1031.627686] sd 0:0:0:0: [sda] Assuming drive cache: write through

[ 1031.627691] sda: detected capacity change from 160041885696 to 0

[ 1031.627715] ata1.00: detaching (SCSI 0:0:0:0)

[ 1031.630278] sd 0:0:0:0: [sda] Stopping disk

[ 1031.630303] sd 0:0:0:0: [sda] START_STOP FAILED

[ 1031.630306] sd 0:0:0:0: [sda]  

[ 1031.630307] Result: hostbyte=0x04 driverbyte=0x00

[ 1091.849488] EXT4-fs error (device sda1): ext4_find_entry:1309: inode #7864811: comm smbd: reading directory lblock 0

[ 1091.849936] EXT4-fs error (device sda1): ext4_find_entry:1309: inode #7864811: comm smbd: reading directory lblock 0

[ 1091.850304] EXT4-fs error (device sda1): ext4_find_entry:1309: inode #7864811: comm smbd: reading directory lblock 0

[ 1091.850687] EXT4-fs error (device sda1): ext4_find_entry:1309: inode #7864811: comm smbd: reading directory lblock 0

[ 1091.851056] EXT4-fs error (device sda1): ext4_find_entry:1309: inode #7864811: comm smbd: reading directory lblock 0

[ 1091.851434] EXT4-fs error (device sda1): ext4_find_entry:1309: inode #7864811: comm smbd: reading directory lblock 0

[ 1091.851789] EXT4-fs error (device sda1): ext4_find_entry:1309: inode #7602198: comm smbd: reading directory lblock 0

```

sda drive lost, but was not reassigned as sdc this time. blkid shows only sdb.

I did not reboot machine and trying to get drive recognized again.

----------

## creaker

Scanned for lost drive at all available hosts:

```
server ~ # echo "0 0 0" > /sys/class/scsi_host/host0/scan

server ~ # echo "0 0 0" > /sys/class/scsi_host/host1/scan 

server ~ # echo "0 0 0" > /sys/class/scsi_host/host2/scan 

server ~ # echo "0 0 0" > /sys/class/scsi_host/host3/scan
```

it produced these lines in dmesg:

```
[ 2161.534570] ata1: exception Emask 0x10 SAct 0x0 SErr 0x19d0000 action 0xf

[ 2161.534576] ata1: SError: { PHYRdyChg CommWake 10B8B Dispar LinkSeq TrStaTrns }

[ 2161.534585] ata1: hard resetting link

[ 2162.435243] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

[ 2162.455350] ata1.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)

[ 2167.438442] ata1: hard resetting link

[ 2168.349050] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

[ 2168.369152] ata1.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)

[ 2168.369157] ata1: limiting SATA link speed to 1.5 Gbps

[ 2173.352248] ata1: hard resetting link

[ 2174.262857] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

[ 2174.282964] ata1.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)

[ 2179.266060] ata1: hard resetting link

[ 2180.176662] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

[ 2180.176669] ata1: EH complete

[ 2210.359627] ata2: hard resetting link

[ 2210.359632] ata2: nv: skipping hardreset on occupied port

[ 2210.906448] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)

[ 2210.966826] ata2.00: configured for UDMA/133

[ 2210.966829] ata2: EH complete

[ 2234.335838] ata3: soft resetting link

[ 2234.502662] ata3: EH complete

[ 2249.458048] ata4: port disabled--ignoring

[ 2249.458053] ata4: EH complete
```

Added:

Tried to scan few minutes later, this time scaned successfully, drive found:

```
[ 3181.449101] ata1: hard resetting link

[ 3182.351889] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

[ 3182.372186] ata1.00: ATA-8: ST3160318AS, CC38, max UDMA/133

[ 3182.372190] ata1.00: 312581808 sectors, multi 16: LBA48 NCQ (depth 0/32)

[ 3182.412202] ata1.00: configured for UDMA/133

[ 3182.412207] ata1: EH complete

[ 3182.412304] scsi 0:0:0:0: Direct-Access     ATA      ST3160318AS      CC38 PQ: 0 ANSI: 5

[ 3182.412508] sd 0:0:0:0: [sdc] 312581808 512-byte logical blocks: (160 GB/149 GiB)

[ 3182.412605] sd 0:0:0:0: [sdc] Write Protect is off

[ 3182.412609] sd 0:0:0:0: [sdc] Mode Sense: 00 3a 00 00

[ 3182.412643] sd 0:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

[ 3182.423360]  sdc: sdc1

[ 3182.423645] sd 0:0:0:0: [sdc] Attached SCSI disk
```

----------

## Akkara

I've seen that happen when a SATA cable is loose.  Try jiggling them while watching dmesg

----------

## Anon-E-moose

Everything between the exception and the hard reset means that the sata controller and the sata device lost communication.

As Akkara said it could be a loose cable.

Or it could be the other things I mentioned earlier.

I've swapped out my non-locking sata cables for the ones that lock, in the past.

 *creaker wrote:*   

> I did not reboot machine and trying to get drive recognized again.

 

I know you wanted to see if it would reconnect with a scan but

I recommend a reboot when the disconnect and reconnect happens.

I've had strange errors show up on the filesystems when I've had that happen in the past and not rebooted

especially when the filesystem was mounted when it disconnected/reconnected.

----------

## creaker

OK, I've replaced both power and data cords. Let see will it solve a problem.

Though I doubt it is a cable issue due to two reasons:

1. If it's a data cable failure it should be mentioned in smart output, but seems no data transmission issues there:

```
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE

...

...

199 UDMA_CRC_Error_Count    -OSRCK   200   200   000    -    0
```

2. If it's a power cord issue dmesg will not show attempts to reconnect drive.

I've searched on web for 'exception Emask 0x10 SAct 0x0 SErr 0x19d0000 action 0xf' message and found thait may be kernel issue.

Anyway I replaced cables, have to see.

----------

## lagalopex

smartctl also mentions a possible firmware update CC49:

 *Quote:*   

> Firmware Version: CC38
> 
> …
> 
> ==> WARNING: A firmware update for this drive may be available,
> ...

 

you could run a long smart test. (But it should be successfull, as the error behaviour points to somewhere between hdd and kernel.)

----------

## creaker

Server used intensively for the last two days and seems cords replacing solved a problem.

Thanks for helping, guys!

----------

