# DMA Trouble - Too many drives?

## mr-simon

Okay, I have a server which had an array of seven IDE drives in it. Three on the on-montherboard IDE controller, and four on an attached siimage controller. It was working fine, all raid5 and stuff. -- I read about the fact that we can now (>2.6.17, and recent mdadm) add drives to a raid5 array and grow it. I couldn't resist!

So, I installed a promise ata controller, and two more drives. (I tried this with a siimage controller too, same results...)

All is going according to plan until it becomes time to reshape the array. This obviously takes a lot of IO bandwidth. I start getting all sorts of nasty DMA-related errors in dmesg. (DriveReady SeekComplete, DMA timeout, other such things -- I don't have network access to the box right now so it's hard to paste here)

This quickly results in getting DMA disabled for one or more disks. Trying to reenable DMA for the disks doesn't last long, and sometimes locks the box completely.

If I restart the box, and try again, the same thing happens. BUT! With different drives each time. It seems pretty random which drives get DMA disabled on them by the kernel.

It's important to note at this point that I am sure that all the drives are okay. I've tested them thoroughly with smartctl, and no errors are reported. It looks very much like a dma/ide controller issue.

If I turn the DMA level down to udma0 (hdparm -X64 -d1 /dev/all_drives) it lasts a fair bit longer before the problem arises, and it tends to be just two drives that start reporting errors, rather than lots all over each other.

Unfortunately, I can't leave it to reconstruct with DMA off. It will take several weeks, a month maybe. (It's a 1.5Tb array full of multimedia files, being resized to 2Tb -- no, obviously I don't have backups!)

I've tried with the self-configured 2.6.21 gentoo-sources kernel on the box, and I've also tried booting from the latest gentoo livecd, which has 2.6.19... Both behave exactly the same.

I've just started the process again with -X34 (mdma2, instead of udma) set on all the drives, and fingers crossed it's working so far (10 minutes or so) which reduces the reshaping time down to 24 hours, which is an improvement. But, I'd rather not have to set this on all my drives forever more.

So... Any ideas what might cause this? Do I simply have too many IDE drives in my computer? Controllers? Is there a limit on the bandwidth they can sustain? It *looks* like they're trying to do too much at the same time, and tripping over each other as a result. Maybe there's some PCI-related BIOS tweakage that can help? I'm clutching at straws here.

----------

## NeddySeagoon

mr-simon,

It could be a broken APIC.  Look in /proc/interrups for entries like

```
            CPU0       

  0:         62   IO-APIC-edge      timer
```

which show the APIC in use. To test, reboot with noapic added to the kernel parameters on the kernel line in grub. This will force the use of XT-PIC mode.

----------

## eccerr0r

I have a promise ultra66 PATA PCI offboard controller.  It gives me all sorts of different problems when I have more than 1 disk per channel, on certain disks.  I think some disks and some controllers are just designed poorly and can't handle more than one disk per channel and I get dreaded CRC errors and DMA errors.

You might want to check cables as well, make sure they're in good shape.  Might have to swap to different board manufacturers, but having that many DMA channels on the machine should not be a problem at all.

----------

## mr-simon

```
rumplestiltskin simon # cat /proc/interrupts

           CPU0

  0:         63   IO-APIC-edge      timer

  1:         10   IO-APIC-edge      i8042

  9:          0   IO-APIC-fasteoi   acpi

 14:      84495   IO-APIC-edge      ide4

 15:      40627   IO-APIC-edge      ide5

 16:      36948   IO-APIC-fasteoi   eth0

 17:     141597   IO-APIC-fasteoi   eth1

 18:     171298   IO-APIC-fasteoi   ide0, ide1

 19:      85719   IO-APIC-fasteoi   ide3

 21:          0   IO-APIC-fasteoi   VIA8237

NMI:          0

LOC:    9466240

ERR:          0

MIS:          0
```

I don't really know what that means, but booting with noapic didn't seem to make any difference.

Update: I managed to complete the reshaping operation, by setting "hdparm -X34" on all the disks. However, it still seems there are DMA issues. Halfway through the reshape I tried to get hold of a file on there, so I mounted the partition and started scp'ing files off it. This was obviously too much and I hit DMA errors, and the array went down -- two drives apparently "failed". After a reboot, and an mdadm --assemble --force, the reshaping continued and my data is instact.

I've swapped out all the cables (I have loads), and I've tried replacing the promise with a siimage card and back again. I don't have a spare to try replacing the other siimage with though.

I don't have enough controllers (or PCI slots!) to go down the 1-drive-per-channel route.

It's worth noting again that it's not just the new drives that are causing problems. If I don't disable udma (hdparm -X34) on all drives, when I cp a big file the array will most likely fail. When it does, two drives are always listed as having failed ( [U_UUUU_UU] ) -- which drives fail, and which controller they are connected to, are random. If it were a bad cable I'd have guessed it would always be the same drives.

It's only serving media over a 10mbit connection, so I'm not that worried about the loss of performance by setting -X34, but as it's possible to get DMA errors even with this setting, I'd like to get it cleared up. I don't think all this forced reassembly of my array is doing it any long-term good, in the data integrity department.

----------

## energyman76b

a lot of mainboards have crappy pci (especially everything with nvidia and via chipsets) so .. there is always the chance that you hit one of their many bugs  :Wink: 

but.. have you made sure, that it is not the cables? IDE-cables a f*ing fragile, and break easily when a little bit bend when installing new drives or cards ...  

I had a lot of problems with that cables - one reason I don't touch them anymore after the first installation....

----------

## mr-simon

 *energyman76b wrote:*   

> have you made sure, that it is not the cables?

 

 *mr-simon wrote:*   

> I've swapped out all the cables (I have loads)

 

 :Smile: 

I might try using a PCI IDE controller instead of the on-motherboard one. It's one thing I haven't removed from the equation. I have one free PCI slot left.

----------

## RaceTM

 *energyman76b wrote:*   

> a lot of mainboards have crappy pci (especially everything with nvidia and via chipsets) so .. there is always the chance that you hit one of their many bugs 
> 
> but.. have you made sure, that it is not the cables? IDE-cables a f*ing fragile, and break easily when a little bit bend when installing new drives or cards ...  
> 
> I had a lot of problems with that cables - one reason I don't touch them anymore after the first installation....

 

If you think IDE cables are fragile, I hope you haven't encountered SATA cables yet

 :Very Happy: 

----------

## energyman76b

 *RaceTM wrote:*   

>  *energyman76b wrote:*   a lot of mainboards have crappy pci (especially everything with nvidia and via chipsets) so .. there is always the chance that you hit one of their many bugs 
> 
> but.. have you made sure, that it is not the cables? IDE-cables a f*ing fragile, and break easily when a little bit bend when installing new drives or cards ...  
> 
> I had a lot of problems with that cables - one reason I don't touch them anymore after the first installation.... 
> ...

 

oh I have.. with one of them, my harddisk is not found, with the other one I had to set 'driving strenght' to strong - and they are getting loose just if you look at them. I hate them already.

----------

## mr-simon

But, to get the thread back on topic ..  :Wink: 

Here is some (brief) example output from dmesg, that I created by booting with noapic and pci=noacpi appended to my kernel, and setting the drives to the fastest possible mode (udma5):

```
hdd: dma_intr: status=0x51 { DriveReady SeekComplete Error }

hdd: dma_intr: error=0x10 { SectorIdNotFound }, LBAsect=542239555710, high=32319, low=16711806, sector=264863337

ide: failed opcode was: unknown

hdd: dma_intr: status=0x51 { DriveReady SeekComplete Error }

hdd: dma_intr: error=0x10 { SectorIdNotFound }, LBAsect=924491645054, high=55103, low=16711806, sector=264886097

ide: failed opcode was: unknown

hdi: drive_cmd: status=0x51 { DriveReady SeekComplete Error }

hdi: drive_cmd: error=0x04 { DriveStatusError }

ide: failed opcode was: 0xb0
```

Note that:

* It is not a cabling issue. I've swapped them around, and it's different drives every time.

* It's not a specific controller issue. hdd is on the siimage controller, hdi connected to the mainboard. This is just an example. It's not always those drives

* It's not a problem with a specific kernel version. It shows with 2.6.19 (from the livecd), 2.6.14, and 2.6.21

* It's not an apic/acpi issue. I've switched them off.

See:

```
rumplestiltskin simon # cat /proc/interrupts

           CPU0

  0:         53    XT-PIC-XT        timer

  1:       1464    XT-PIC-XT        i8042

  2:          0    XT-PIC-XT        cascade

  4:      23328    XT-PIC-XT        ide0, ide1

  5:          0    XT-PIC-XT        VIA8237

  9:          0    XT-PIC-XT        acpi

 10:      10716    XT-PIC-XT        ide3

 11:       3729    XT-PIC-XT        eth1

 12:     154813    XT-PIC-XT        eth0

 14:      11538    XT-PIC-XT        ide4

 15:       6052    XT-PIC-XT        ide5

NMI:        653

LOC:     308447

ERR:         96

MIS:          0
```

Yes, I'm sharing irq4. No, switching acpi back on doesn't help.

Switching to mdma2 is the only thing that seems to help, but it's not perfect.  :Sad:  The occasional error can rarely appear. But also my system isn't exactly speedy.

Maybe the following information will help?

```
rumplestiltskin .var # lspci

00:00.0 Host bridge: VIA Technologies, Inc. VT8377 [KT400/KT600 AGP] Host Bridge (rev 80)

00:01.0 PCI bridge: VIA Technologies, Inc. VT8237 PCI Bridge

00:0a.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit Ethernet (rev 10)

00:0c.0 RAID bus controller: Silicon Image, Inc. PCI0680 Ultra ATA-133 Host Controller (rev 02)

00:0e.0 Mass storage controller: Promise Technology, Inc. 20269 (rev 02)

00:0f.0 RAID bus controller: VIA Technologies, Inc. VIA VT6420 SATA RAID Controller (rev 80)

00:0f.1 IDE interface: VIA Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06)

00:11.0 ISA bridge: VIA Technologies, Inc. VT8237 ISA bridge [KT600/K8T800/K8T890 South]

00:11.5 Multimedia audio controller: VIA Technologies, Inc. VT8233/A/8235/8237 AC97 Audio Controller (rev 60)

00:11.6 Communication controller: VIA Technologies, Inc. AC'97 Modem Controller(rev 80)

00:13.0 Ethernet controller: VIA Technologies, Inc. VT6102 [Rhine-II] (rev 43)

01:00.0 VGA compatible controller: nVidia Corporation NV5M64 [RIVA TNT2 Model 64/Model 64 Pro] (rev 15)
```

I switched off all unnecessary hardware in the PC (USB, serial/parallel ports, for example) to free up irq's, to help with trying it out with apic disabled.

Ho, hum. The only other thing I can think of to try is what I mentioned before, unplugging the drives from the mobo and using another PCI controller. I had a brief go at this, but had trouble making it boot. I'll give it another go later I guess, but I'm not convinced it will help.

----------

## NeddySeagoon

mr-simon,

Your /proc/interrupts shows that your apic is active. When you booted with noapic the text IO-APIC should nave changed.

How many wires are there in your IDE ribbon cables ?

40 or 80 ?

You must use 80 wire ribbons for UDMA modes.

If you put a 40 wire and 80 wire ribbon side by side, you will clearly see the difference in wire pitch - the ribbons themselves and are same overall width. If you are not sure (because your only have one sort) compare with a floppy cable. They are the same wire pitch as 40 wire IDE ribbons.

----------

## mr-simon

 *NeddySeagoon wrote:*   

> Your /proc/interrupts shows that your apic is active. When you booted with noapic the text IO-APIC should nave changed.

 

Please see my more receent post. It says XT-PIC-XT. The other, earlier 'cat /proc/interrupts' was taken before I booted with noacpi. Sorry for the confusion.

 *NeddySeagoon wrote:*   

> How many wires are there in your IDE ribbon cables ?
> 
> 40 or 80 ?

 

They're 80-wire cables.

----------

## eccerr0r

Are all these disks using the same power supply, how many watt power supply?  Is it a good quality supply or a no-name?

Can you try with more than 1 power supply and see if it will still fail? (use the short green to ground "trick" to power up the second p/s, google for details)

----------

## NeddySeagoon

mr-simon,

I just picked up on the block numbers ..

```
LBAsect=542239555710

LBAsect=924491645054
```

Block 542,239,555,710 is at 277,626,652,523,520 or 277TB block 924,491,645,054 is even further down the drive.

Its unlikely your disk is that big.

From that I conclude that commands and data are being scrambled between the controller and the card.

Are your 80 wire ribbon cables fitted the right way round ?

The two ends are different (electically).

If you have only one drive on a cable, is it at the end, not in the middle ?

----------

