# SSD drive and occasional failed command: WRITE FPDMA QUEUED

## tv007

I've bought a SSD as a replacement for my SATA drive that served as a / at my home workstation. Everything seems to work fine except that I get some strange NCQ errors about failed commands. It's either "READ FPDMA QUEUED" or "WRITE FPDMA QUEUED" and it looks like this:

```
Jun 19 01:05:43 rimmer kernel: ata6: EH in SWNCQ mode,QC:qc_active 0x1F sactive 0x1F

Jun 19 01:05:43 rimmer kernel: ata6: SWNCQ:qc_active 0x1B defer_bits 0x4 last_issue_tag 0x1

Jun 19 01:05:43 rimmer kernel: dhfis 0x19 dmafis 0x19 sdbfis 0x0

Jun 19 01:05:43 rimmer kernel: ata6: ATA_REG 0x40 ERR_REG 0x0

Jun 19 01:05:43 rimmer kernel: ata6: tag : dhfis dmafis sdbfis sacitve

Jun 19 01:05:43 rimmer kernel: ata6: tag 0x0: 1 1 0 1  

Jun 19 01:05:43 rimmer kernel: ata6: tag 0x1: 0 0 0 1  

Jun 19 01:05:43 rimmer kernel: ata6: tag 0x3: 1 1 0 1  

Jun 19 01:05:43 rimmer kernel: ata6: tag 0x4: 1 1 0 1  

Jun 19 01:05:43 rimmer kernel: ata6.00: exception Emask 0x0 SAct 0x1f SErr 0x0 action 0x6 frozen

Jun 19 01:05:43 rimmer kernel: ata6.00: failed command: WRITE FPDMA QUEUED

Jun 19 01:05:43 rimmer kernel: ata6.00: cmd 61/10:00:14:d5:c8/00:00:0a:00:00/40 tag 0 ncq 8192 out

Jun 19 01:05:43 rimmer kernel: res 40/00:08:ec:f2:e8/84:00:02:00:00/40 Emask 0x4 (timeout)

Jun 19 01:05:43 rimmer kernel: ata6.00: status: { DRDY }

Jun 19 01:05:43 rimmer kernel: ata6.00: failed command: WRITE FPDMA QUEUED

Jun 19 01:05:43 rimmer kernel: ata6.00: cmd 61/38:08:34:d5:c8/00:00:0a:00:00/40 tag 1 ncq 28672 out

Jun 19 01:05:43 rimmer kernel: res 40/00:08:ec:f2:e8/84:00:02:00:00/40 Emask 0x4 (timeout)

Jun 19 01:05:43 rimmer kernel: ata6.00: status: { DRDY }

Jun 19 01:05:43 rimmer kernel: ata6.00: failed command: WRITE FPDMA QUEUED

Jun 19 01:05:43 rimmer kernel: ata6.00: cmd 61/08:10:74:d5:c8/00:00:0a:00:00/40 tag 2 ncq 4096 out

Jun 19 01:05:43 rimmer kernel: res 40/00:08:ec:f2:e8/84:00:02:00:00/40 Emask 0x4 (timeout)

Jun 19 01:05:43 rimmer kernel: ata6.00: status: { DRDY }

Jun 19 01:05:43 rimmer kernel: ata6.00: failed command: WRITE FPDMA QUEUED

Jun 19 01:05:43 rimmer kernel: ata6.00: cmd 61/08:18:54:d4:c8/00:00:0a:00:00/40 tag 3 ncq 4096 out

Jun 19 01:05:43 rimmer kernel: res 40/00:08:ec:f2:e8/84:00:02:00:00/40 Emask 0x4 (timeout)

Jun 19 01:05:43 rimmer kernel: ata6.00: status: { DRDY }

Jun 19 01:05:43 rimmer kernel: ata6.00: failed command: WRITE FPDMA QUEUED

Jun 19 01:05:43 rimmer kernel: ata6.00: cmd 61/08:20:f4:d4:c8/00:00:0a:00:00/40 tag 4 ncq 4096 out

Jun 19 01:05:43 rimmer kernel: res 40/00:08:ec:f2:e8/84:00:02:00:00/40 Emask 0x4 (timeout)

Jun 19 01:05:43 rimmer kernel: ata6.00: status: { DRDY }

Jun 19 01:05:43 rimmer kernel: ata6: hard resetting link

Jun 19 01:05:43 rimmer kernel: ata6: nv: skipping hardreset on occupied port

Jun 19 01:05:43 rimmer kernel: ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

Jun 19 01:05:43 rimmer kernel: ata6.00: configured for UDMA/133

Jun 19 01:05:43 rimmer kernel: ata6: EH complete

```

The ata6 is the SSD drive. When it's a 'READ FPDMA QUEUED' then it looks like this: http://pastebin.com/r1EedyuP, especially it always references the CHS sector:

```

Jun 18 22:47:52 rimmer kernel: ata6.00: device reported invalid CHS sector 0

Jun 18 22:47:52 rimmer kernel: ata6.00: device reported invalid CHS sector 0

Jun 18 22:47:52 rimmer kernel: ata6.00: device reported invalid CHS sector 0

Jun 18 22:47:52 rimmer kernel: ata6.00: device reported invalid CHS sector 0

Jun 18 22:47:52 rimmer kernel: ata6.00: device reported invalid CHS sector 0

Jun 18 22:47:52 rimmer kernel: ata6.00: device reported invalid CHS sector 0

Jun 18 22:47:52 rimmer kernel: ata6.00: device reported invalid CHS sector 0

```

I have no idea why this happens - the device halts for a few seconds and then everything works just fine. I did not have time to run fsck on the drive, but the smartctl looks OK (the full output is here: http://pastebin.com/0Zx64tRs):

```
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  3 Spin_Up_Time            0x0020   100   100   000    Old_age   Offline      -       0

  4 Start_Stop_Count        0x0030   100   100   000    Old_age   Offline      -       0

  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0

  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       5

 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       6

170 Unknown_Attribute       0x0033   100   100   010    Pre-fail  Always       -       0

171 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0

172 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0

184 End-to-End_Error        0x0033   100   100   090    Pre-fail  Always       -       0

187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0

192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       5

225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       3883

226 Load-in_Time            0x0032   100   100   000    Old_age   Always       -       21

227 Torq-amp_Count          0x0032   100   100   000    Old_age   Always       -       0

228 Power-off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       2362

232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0

233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       0

241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       3883

242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       -       3452

```

The only recommendations I've found were to use 'libata.force=noncq' kernel parameter and to run 'hdparm -Q 1' on the drive. I've tried both, nothing changed. Except that the number of failed commands logged reflects the number set with hdparm - e.g. when I use '-Q 5' the log contains 5 'WRITE FPDMA QUEUED' commands.

How to fix this? Why is it happening? The basic system info:

 kernel: 2.6.36.1 (vanilla, but I've initially seen exactly the same problems with the current livecd)

 motherboard: Asus M2N-E (nvidia nforce-570 chipset)

 SSD: Intel 320 (120GB version)

 filesystem: reiserfs 3.6

I've checked that all the SATA cables are OK, and the original SATA drive was working just fine on the very same cable for several years. I've just replaced it with the SSD.

I've originally copied the data to the SSD using dd (the drives are of exactly the same size), and IIRC then there was no such error. Might be a coincidence, but it's kinda suspicious.

Any ideas what causes this and how to fix it?

----------

## tv007

I'm just wondering - when moving the data from the old HDD to the SSD, I've copied the whole device using "dd" (the drives are exactly of the same size). Could this be the problem? I know SSD drives need to be partitioned to get optimal results (due to the 512k blocks), but I doubt it could cause such problems. Or this might be the real cause?

----------

## Hu

 *tv007 wrote:*   

> I've copied the whole device using "dd" (the drives are exactly of the same size). Could this be the problem?

 That was probably bad for your drive, whether or not it caused the problem you reported.  SSDs work much better when they know which areas contain useful data and which do not.  By writing to every sector via dd, you have convinced the SSD that it is "full", so now it will preserve every sector.  If your drive supports the TRIM command, you may be able to mitigate the damage that way.

----------

## tv007

 *Hu wrote:*   

>  *tv007 wrote:*   I've copied the whole device using "dd" (the drives are exactly of the same size). Could this be the problem? That was probably bad for your drive, whether or not it caused the problem you reported.  SSDs work much better when they know which areas contain useful data and which do not.  By writing to every sector via dd, you have convinced the SSD that it is "full", so now it will preserve every sector.  If your drive supports the TRIM command, you may be able to mitigate the damage that way.

 

Yes, I'm used to copy partitions like this and I've realized too late this might be a problem for SSD. Anyway I think that shouldn't cause the I/O errors I've described. What should I do to fix that? I plan to add 'discard' to the mount options, and rewrite the free space with zeroes (cat /dev/zero > file.tmp && rm file.tmp). That should do the trick I guess?

I plan to repartition the drive to get proper alignment, and I'm thinking about a fresh install (I'm still on 32bits and I'm considering to switch to 64bits).

Anyway I haven't seen the I/O errors for about two days - not sure what changed. Yesterday I've flashed the  BIOS on the MB, I've replaced the SATA cable, moved the drive to a separate power line (all the other drives are on the other one), changed the elevator to noop etc. So far everything seems fine (and I hope it'll stay like that).

----------

## Hu

Some filesystems will automatically issue a TRIM when they are created.  If you use one of those, explicit clearing should not be necessary.

----------

## tv007

 *Hu wrote:*   

> Some filesystems will automatically issue a TRIM when they are created.  If you use one of those, explicit clearing should not be necessary.

 

Yes, I know (now). I've found a nice article describing how to partition an SSD etc. I've copied the data to another drive, I've repartitioned the SSD to get a proper partition alignment and then I've created an ext4 partition so now I've got this.

```
$ fdisk -S 32 -H 32 /dev/sdb

Command (m for help): p

Disk /dev/sdb: 120.0 GB, 120034123776 bytes

32 heads, 32 sectors/track, 228946 cylinders, total 234441648 sectors

Units = sectors of 1 * 512 = 512 bytes

Sector size (logical/physical): 512 bytes / 512 bytes

I/O size (minimum/optimal): 512 bytes / 512 bytes

Disk identifier: 0x7c257c25

   Device Boot      Start         End      Blocks   Id  System

/dev/sdb1            2048      133119       65536   83  Linux

/dev/sdb2          133120     8521727     4194304   82  Linux swap / Solaris

/dev/sdb3         8521728   234441647   112959960   83  Linux
```

Not sure why the "boot" partition (sdb1) starts at 2048 (I guess 512 would be just as fine), but otherwise the partitions are nicely aligned to 512kB.

The ext4 was created like this

```
mke2fs -t ext4 -E stripe-size=128 /dev/sdb3
```

so it should be nicely aligned too (128 x 4kB blocks = 512kB). AFAIK ext4 clears all the blocks when it's created, and I've mounted it with 'discard' so this should be fixed too.

Hopefully this will make all those strange I/O errors go away ...

----------

## tv007

So no luck - I just got bunch of "WRITE FPDMA QUEUED" errors  :Sad: 

The full dmesg output (including the I/O errors) is available here : http://pastebin.com/7pkreUCA

I really wonder how this can happen, because I've set the io scheduler to noop for the SSD, yet the errors are somehow related to SWNCQ

```
EXT4-fs (sdb3): re-mounted. Opts: discard,commit=0

ata6: EH in SWNCQ mode,QC:qc_active 0x7FFFFFFF sactive 0x7FFFFFFF

ata6: SWNCQ:qc_active 0x1E031 defer_bits 0x7FFE1FCE last_issue_tag 0x10

  dhfis 0xE031 dmafis 0x6010 sdbfis 0x0

ata6: ATA_REG 0x40 ERR_REG 0x0

ata6: tag : dhfis dmafis sdbfis sacitve

ata6: tag 0x0: 1 0 0 1  

ata6: tag 0x4: 1 1 0 1  

ata6: tag 0x5: 1 0 0 1  

ata6: tag 0xd: 1 1 0 1  

ata6: tag 0xe: 1 1 0 1  

ata6: tag 0xf: 1 0 0 1  

ata6: tag 0x10: 0 0 0 1  

ata6.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x6 frozen

ata6.00: failed command: WRITE FPDMA QUEUED

ata6.00: cmd 61/10:00:10:d7:f0/00:00:05:00:00/40 tag 0 ncq 8192 out

...
```

```
rimmer ~ # cat /sys/block/sdb/queue/scheduler 

[noop] deadline cfq
```

I have no idea what's wrong. It could be a hw problem (e.g. a motherboard issue), but it was very reliable till today.

----------

## gorkypl

Could you try with the latest kernel?

----------

## tv007

 *gorkypl wrote:*   

> Could you try with the latest kernel?

 

I've already upgraded to 2.6.38-gentoo-r6, i.e. the latest stable version, two days ago. The problem is still there - some additional info

dmesg : http://pastebin.com/uHvTVmss

.config : http://pastebin.com/PYeLKaBL

lspci : http://pastebin.com/nQPS0rxU

smartctl : http://pastebin.com/DwJfxdTK

I've started a new thread on the lkml mailing list, https://lkml.org/lkml/2011/6/22/476, no reply yet.

It seems this might be a sata chipset glitch (not sure why it did not fail before, with a traditional HDD - maybe the SSD is so fast it causes a race condition). I do have an unused Promise FastTrak TX4 controller, I'll try to use it instead of the onboard Nvidia MCP55 chipset.

----------

