# SCSI parity errors on 3 SATA disks.  Hardware? Kernel?

## Gilbo

The system is an NForce4 Ultra chipset.  I've got, 1,  SCSI disk support, & ,2, low-level AMD-NVidia SATA driver support configured.

Here's the relevant crankiness from dmesg:

```
ata4: command 0x35 timeout, stat 0xd0 host_stat 0x21

ata4: translated ATA stat/err 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00

ata4: status=0xd0 { Busy }

sd 3:0:0:0: SCSI error: return code = 0x8000002

sdd: Current: sense key: Aborted Command

    Additional sense: Scsi parity error

end_request: I/O error, dev sdd, sector 490234559

ATA: abnormal status 0xD0 on port 0x967

ATA: abnormal status 0xD0 on port 0x967

ATA: abnormal status 0xD0 on port 0x967

ata2: command 0x35 timeout, stat 0xd0 host_stat 0x21

ata2: translated ATA stat/err 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00

ata2: status=0xd0 { Busy }

sd 1:0:0:0: SCSI error: return code = 0x8000002

sdb: Current: sense key: Aborted Command

    Additional sense: Scsi parity error

end_request: I/O error, dev sdb, sector 490223295

ATA: abnormal status 0xD0 on port 0x977

ATA: abnormal status 0xD0 on port 0x977

ATA: abnormal status 0xD0 on port 0x977

ata1: command 0x35 timeout, stat 0xd0 host_stat 0x21

ata1: translated ATA stat/err 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00

ata1: status=0xd0 { Busy }

sd 0:0:0:0: SCSI error: return code = 0x8000002

sda: Current: sense key: Aborted Command

    Additional sense: Scsi parity error

end_request: I/O error, dev sda, sector 490234559

ATA: abnormal status 0xD0 on port 0x9F7

ATA: abnormal status 0xD0 on port 0x9F7

ATA: abnormal status 0xD0 on port 0x9F7

ata4: command 0xea timeout, stat 0xd0 host_stat 0x0

ata4: translated ATA stat/err 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00

ata4: status=0xd0 { Busy }

raid5: Disk failure on sdd1, disabling device. Operation continuing on 4 devicesata2: command 0xea timeout, stat 0xd0 host_stat 0x0

ata2: translated ATA stat/err 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00

ata2: status=0xd0 { Busy }

raid5: Disk failure on sdb1, disabling device. Operation continuing on 3 devicesata1: command 0xea timeout, stat 0xd0 host_stat 0x0

ata1: translated ATA stat/err 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00

ata1: status=0xd0 { Busy }

raid5: Disk failure on sda1, disabling device. Operation continuing on 2 devices
```

As you can see sda, sdb, & sdd are all returning these parity errors.  They're part of a new RAID 5 array I was trying to create.

I'm leaning towards hardware (what are the odds of 3/4 new disks being bad?), but I don't know if there could be a software reason for this.  Cables?  Motherboard (one SATA disk seems okay)?

----------

## NeddySeagoon

Gilbo,

The nForce4 provides fakeraid, so you have a choice of dmraid or kernel raid?

Which one have you chosen?

The error comes just after 233GiB (251Gb) into sda and sdd. 

If your drives are not that big, you are trying to read off the end of the drives.

----------

## Gilbo

I chose kernel RAID.

I should have looked at the sector indication --I didn't even notice it.  I think you're right about reading off the end of the disks.  The 4 SATA disks are the same make (250GB Maxtors), but two have slightly different sizes.  The first disk in the RAID is a 300GB ATA Maxtor, with the first partition set to the size of the smallest 250GB SATA disk.  I thought I got everything exactly right, but it sounds like 3 partitions are set slightly too small.

However, you can see the partitioning scheme below.  One of the failing partitions (sdb1) has the exact same number of sectors as the good disks.  The other two failures (sda1 & sdd1) both have more sectors than necessary.  I must be missing something.  I'm going to try some repartitioning.

Why would the kernel try to read off the end of the disks?

```
Server01 gilbo # fdisk -lu /dev/hda

Disk /dev/hda: 300.0 GB, 300090728448 bytes

255 heads, 63 sectors/track, 36483 cylinders, total 586114704 sectors

Units = sectors of 1 * 512 = 512 bytes

   Device Boot      Start         End      Blocks   Id  System

/dev/hda1              63   490223474   245111706   fd  Linux raid autodetect

/dev/hda2       490223475   492223475     1000000+  82  Linux swap / Solaris

/dev/hda3   *   492223476   492295859       36192   83  Linux

/dev/hda4       492295860   586099394    46901767+  83  Linux

Server01 gilbo # fdisk -lu /dev/sda

Disk /dev/sda: 251.0 GB, 251000193024 bytes

255 heads, 63 sectors/track, 30515 cylinders, total 490234752 sectors

Units = sectors of 1 * 512 = 512 bytes

   Device Boot      Start         End      Blocks   Id  System

/dev/sda1              63   490234751   245117344+  fd  Linux raid autodetect

Server01 gilbo # fdisk -lu /dev/sdb

Disk /dev/sdb: 251.0 GB, 251000193024 bytes

255 heads, 63 sectors/track, 30515 cylinders, total 490234752 sectors

Units = sectors of 1 * 512 = 512 bytes

   Device Boot      Start         End      Blocks   Id  System

/dev/sdb1              63   490223474   245111706   fd  Linux raid autodetect

Server01 gilbo # fdisk -lu /dev/sdc

Disk /dev/sdc: 251.0 GB, 251000193024 bytes

255 heads, 63 sectors/track, 30515 cylinders, total 490234752 sectors

Units = sectors of 1 * 512 = 512 bytes

   Device Boot      Start         End      Blocks   Id  System

/dev/sdc1              63   490223474   245111706   fd  Linux raid autodetect

Server01 gilbo # fdisk -lu /dev/sdd

Disk /dev/sdd: 251.0 GB, 251000193024 bytes

255 heads, 63 sectors/track, 30515 cylinders, total 490234752 sectors

Units = sectors of 1 * 512 = 512 bytes

   Device Boot      Start         End      Blocks   Id  System

/dev/sdd1              63   490234751   245117344+  fd  Linux raid autodetect

```

Last edited by Gilbo on Fri Jun 09, 2006 6:46 pm; edited 1 time in total

----------

## Gilbo

While trying to alter the partition tables of the damaged disks I discovered that I do not appear to be able to access them.

I can read the table once, but when I try to write a new partition table it takes much longer than usual, then says it works, but doesn't.  Attempting to reread the partition table returns "Unable to read /dev/sda" after a long time.  

dmesg:

```
ata1: command 0x35 timeout, stat 0xd0 host_stat 0x21

ata1: translated ATA stat/err 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00

ata1: status=0xd0 { Busy }

sd 0:0:0:0: SCSI error: return code = 0x8000002

sda: Current: sense key=0xb

    ASC=0x47 ASCQ=0x0

end_request: I/O error, dev sda, sector 0

Buffer I/O error on device sda, logical block 0

lost page write due to I/O error on sda

SCSI device sda: 490234752 512-byte hdwr sectors (251000 MB)

sda: Write Protect is off

sda: Mode Sense: 00 3a 00 00

SCSI device sda: drive cache: write back

 sda:<4>ATA: abnormal status 0xD0 on port 0x9F7

ATA: abnormal status 0xD0 on port 0x9F7

ATA: abnormal status 0xD0 on port 0x9F7

ata1: command 0x25 timeout, stat 0xd0 host_stat 0x21

ata1: translated ATA stat/err 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00

ata1: status=0xd0 { Busy }

sd 0:0:0:0: SCSI error: return code = 0x8000002

sda: Current: sense key=0xb

    ASC=0x47 ASCQ=0x0

end_request: I/O error, dev sda, sector 0

Buffer I/O error on device sda, logical block 0

ATA: abnormal status 0xD0 on port 0x9F7

ATA: abnormal status 0xD0 on port 0x9F7

ATA: abnormal status 0xD0 on port 0x9F7

```

I was hoping this was just a misconfigured a partition problem, but now I'm not so sure.  Hardware  :Sad: ?  I'm going to check cables.

----------

## Gilbo

Yet more information.  I shutdown the system and unplugged and replugged all the cables to ensure they were firmly seated.  They all were.  When I rebooted I was able to partition the disks without any errors.  This is disturbing, I'd be happier with consistency

(NOTE: shutting down the system & powering off the PSU without unplugging the cables does not reset the disks and allow them to be written too.  How can that make a difference, there's no power.  How do they know I unplugged the SATA cable?  Crazy...)

Question:

The kernel was trying to read past the end of the disks.

1.  Why would it do that?

2.  How can I stop it from doing that when I attempt to setup the array again?Last edited by Gilbo on Fri Jun 09, 2006 8:08 pm; edited 1 time in total

----------

## NeddySeagoon

Gilbo,

The kernel was very close to the end of the disk but not trying to read off the end.

To do the sums, you need to knoe that 1 block = 2 sectors and the the error reports absolute disk sectors but partition blocks are with respect ot the start of the partition.

There are some rare circumastances where the kernel does not reread the partition table correctly after fdisk quits.

A reboot ensures the partition table is read, since its a part of the starup process, so I like to reboot after making partitions.

----------

## Gilbo

Thank you very much for your help.  I've managed to get the disks to work again (cable trick --strange), and I've written new partition tables so that all the partitions involved match exactly in sector count.  I don't know if that'll prevent the error from reoccuring, but hopefully it will.

I've created a new array that is syncing right now.  It'll take about 3.5 days to finish the process, so I'll be back then to report success, or to start all over again then.

----------

