# DMA disabled after heavy disk load

## Kreuzader

I have an old nforce2 motherboard and a 120GB IDE drive; DMA gets enabled properly on boot by the kernel (2.6.15; confirmed with hdparm), but after a few days, especially during heavy load, DMA gets disabled:

Feb 18 11:39:39 [kernel] hda: DMA disabled

Feb 18 11:39:39 [kernel] hda: drive not ready for command

Feb 18 11:39:39 [kernel] ide0: reset: success

Moreover, if I try to manually re-enable DMA via hdparm -d1, the system immediately locks up. Once rebooted, the system returns to normal. Has anyone ever seen behavior similar to this? Is it indicative of a dying HDD?

Thanks!

----------

## NeddySeagoon

Kreuzader,

Do you have irq_unmask on, I think thats the name ?

It will be shown by 

```
hdparm /dev/hda
```

If so, turn it off, it allows disk irqs to be interruped and can cause DMA timeouts.

```
emerge smartmontools
```

and read the drives internal error log.

----------

## Kreuzader

It was set to on, but disabling it hasn't helped yet unfortunately.

 *Quote:*   

> Feb 21 23:39:42 [kernel] hda: DMA disabled
> 
> Feb 21 23:39:42 [kernel] hda: drive not ready for command
> 
> Feb 21 23:39:42 [kernel] ide0: reset: success

 

Moreover, smartmontools hasn't logged any errors yet on that drive either  :Sad: 

 *Quote:*   

> SMART Error Log Version: 1
> 
> No Errors Logged
> 
> 

 

If anyone has any other suggestions, I'm open to them. I've even tried replacing the IDE cable.

It wouldn't be that annoying if I could re-enable DMA with hdparm without the system locking up almost immediately.

----------

## bollucks

Try setting a lower UDMA setting before it fails. Perhaps the timing is borderline.

----------

## mbar

This happens also under Windows. If hdd controller detects too many CRC errors during transfer (I think), then it decides to turn off DMA mode and fallback to PIO. It may be caused by defective hdd cable, too long cable, poor quality (mobo/hdd/cable), error in hdd firmware etc etc. And also it may be HDD failing.

Try another hdd cable, this time choose one that is shorter than you have now.

EDIT: Also, check the power cable, try to use other molex  :Smile: 

----------

## NeddySeagoon

Kreuzader,

Do you have an 80 conductor IDE cable with the spare connector (if there is one) in the middle of the cable?

Both are essential for high data rates. As another poster suggested, shorther cables are more reliable than long ones.

Also is the cable the right way round?

One conductor will have about 10mm missing at one end - that connector must go to the primary drive and not the motherboard.

These IDE cables are often reversable.

----------

## Kreuzader

 *bollucks wrote:*   

> Try setting a lower UDMA setting before it fails. Perhaps the timing is borderline.

 

Still happens with UDMA4 or 3 unfortunately, although it seems rarer on those settings. This did help me another way though - I had been trying to re-enable DMA with hdparm -d1 /dev/hda, which was locking up the machine after a few seconds. However, explicitly setting the DMA rate with -X works fine - is this expected behavior?

 *mbar wrote:*   

> This happens also under Windows. If hdd controller detects too many CRC errors during transfer (I think), then it decides to turn off DMA mode and fallback to PIO. It may be caused by defective hdd cable, too long cable, poor quality (mobo/hdd/cable), error in hdd firmware etc etc. And also it may be HDD failing.
> 
> Try another hdd cable, this time choose one that is shorter than you have now.
> 
> EDIT: Also, check the power cable, try to use other molex 

 

I did think it was the IDE cable at first, and replacing it seemed to cut down on the instances (every few days instead of every 20 hours or so).

 *NeddySeagoon wrote:*   

> Kreuzader,
> 
> Do you have an 80 conductor IDE cable with the spare connector (if there is one) in the middle of the cable?
> 
> Both are essential for high data rates. As another poster suggested, shorther cables are more reliable than long ones.
> ...

 

Yep, the cable is correctly (and securely) attached to the motherboard and drives (there's a slave DVD-ROM).

Now that I can re-enable DMA safely when the kernel kills it, this problem isn't as annoying I guess - I'll try living with it until I build a new machine.

Thanks everyone for the suggestions!

----------

## bollucks

 *Kreuzader wrote:*   

>  *bollucks wrote:*   Try setting a lower UDMA setting before it fails. Perhaps the timing is borderline. 
> 
> Still happens with UDMA4 or 3 unfortunately, although it seems rarer on those settings. This did help me another way though - I had been trying to re-enable DMA with hdparm -d1 /dev/hda, which was locking up the machine after a few seconds. However, explicitly setting the DMA rate with -X works fine - is this expected behavior?

 

Not really...

I assume you tried UDMA2 as well? For most workloads it's actually unlikely you'll notice a difference.

Is there anything else unusual about your hardware? cpu speed running ok or clocked differently? Voltage variation? Heat problems? Magnetic interference from somewhere? Interference from aliens encircling the rings of Saturn?

----------

## truekaiser

is the chipset running hot?

----------

## NeddySeagoon

Kreuzader, 

The CD/DVD drive should not be slave to a UDMA drive at all.

Its not capable of UDMA speeds and can cause the entire IDE interface to run at the CDs maximum speed.

It may even be the case of your increased error rate, which is the root cause of your DMA dropping out.

----------

## Kreuzader

 *bollucks wrote:*   

> I assume you tried UDMA2 as well? For most workloads it's actually unlikely you'll notice a difference.
> 
> Is there anything else unusual about your hardware? cpu speed running ok or clocked differently? Voltage variation? Heat problems? Magnetic interference from somewhere? Interference from aliens encircling the rings of Saturn?

 

I went back to UDMA5 after noticing that I could switch DMA back on safely, so I stopped at UDMA3. No overclocking here, although the kernel has logged spurious IRQ interrupts every so often, which I understand can be caused by extraneous system noise no?

 *truekaiser wrote:*   

> is the chipset running hot?

 

Nope  :Sad: 

 *NeddySeagoon wrote:*   

> Kreuzader, 
> 
> The CD/DVD drive should not be slave to a UDMA drive at all.
> 
> Its not capable of UDMA speeds and can cause the entire IDE interface to run at the CDs maximum speed.
> ...

 

That I didn't know - I built this box back in 2003 or so, and didn't start having this problem until last November, so I was racking my brain to figure out what had changed. I'll try using a separate cable for the DVD drive.

----------

## bollucks

 *Kreuzader wrote:*   

>  *bollucks wrote:*   I assume you tried UDMA2 as well? For most workloads it's actually unlikely you'll notice a difference.
> 
> Is there anything else unusual about your hardware? cpu speed running ok or clocked differently? Voltage variation? Heat problems? Magnetic interference from somewhere? Interference from aliens encircling the rings of Saturn? 
> 
> I went back to UDMA5 after noticing that I could switch DMA back on safely, so I stopped at UDMA3. No overclocking here, although the kernel has logged spurious IRQ interrupts every so often, which I understand can be caused by extraneous system noise no?
> ...

 

No, the spurious IRQ interrupts are probably the key here. They should not happen and often cause things to go haywire. It means a hardware interrupt was generated somewhere in your machine, was seen by the cpu, and the kernel had no idea what to do with it. This is possibly an interrupt from your IDE driver that is being lost somehow. Are you using ACPI? or APIC? Recent hardware works best with both enabled.

----------

## Kreuzader

 *NeddySeagoon wrote:*   

> Kreuzader, 
> 
> The CD/DVD drive should not be slave to a UDMA drive at all.
> 
> Its not capable of UDMA speeds and can cause the entire IDE interface to run at the CDs maximum speed.
> ...

 

I've put the DVD drive on the other IDE channel and still get the behavior unfortunately  :Sad: 

 *bollucks wrote:*   

> No, the spurious IRQ interrupts are probably the key here. They should not happen and often cause things to go haywire. It means a hardware interrupt was generated somewhere in your machine, was seen by the cpu, and the kernel had no idea what to do with it. This is possibly an interrupt from your IDE driver that is being lost somehow. Are you using ACPI? or APIC? Recent hardware works best with both enabled.

 

ACPI was not enabled as there were problems with power management in the kernel and the A7N8X when I built this box 3 years ago - I've enabled it now and still get the lockups/IDE resets.

 *Quote:*   

> # ACPI (Advanced Configuration and Power Interface) Support
> 
> #
> 
> CONFIG_ACPI=y
> ...

 

Another data point: the IDE channels will always reset after a cold boot in the middle of the booting process during the partition check - locking up the machine. After a warm reboot (or two), it can boot as normal, but I'm wondering if it's related to environmental factors (i.e., spinning up from a powered off state helps trigger it).

----------

## Kreuzader

Since I'm able to repro the issue from a cold boot consistently, I decided to try and reproduce it after the machine/HDD had been idle for a while. Sure enough, if I let it sit at the command prompt overnight, it'll lock up in the middle of launching X.

Has anyone ever heard of an issue with an IDE channel being reset leading to machine hang after letting the HDD idle for a good while?

edit: went through disk checks again, still no filesystem problems (reiserfsck saw nothing) and smartmontools has no complaints:

```
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000b   200   200   051    Pre-fail  Always       -       0

  3 Spin_Up_Time            0x0007   222   151   021    Pre-fail  Always       -       1933

  4 Start_Stop_Count        0x0032   100   100   040    Old_age   Always       -       127

  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0

  7 Seek_Error_Rate         0x000b   200   200   051    Pre-fail  Always       -       0

  9 Power_On_Hours          0x0032   069   069   000    Old_age   Always       -       22637

 10 Spin_Retry_Count        0x0013   100   100   051    Pre-fail  Always       -       0

 11 Calibration_Retry_Count 0x0013   100   100   051    Pre-fail  Always       -       0

 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       124

196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0

197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0012   200   200   000    Old_age   Always       -       0

199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age   Always       -       0

200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline      -       0

```

----------

## Etherealflaim

 *Kreuzader wrote:*   

> edit: went through disk checks again, still no filesystem problems (reiserfsck saw nothing) and smartmontools has no complaints:
> 
> ```
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
> 
> ...

 Those two lines seem worrisome... though I don't remember offhand how the Spin_Up_Time THRESH translates into the Raw_Value, that seems a bit high... For me, it's 127.

----------

## Kreuzader

Thanks - I'll keep an eye on disk behavior in case the drive starts to fail in more obvious ways.

The problem seems to have been alleviated by removing ACPI support from the kernel, disabling APIC when it loads on boot, and passing the pci=routeirq argument to enable legacy interrupt behavior. I can no longer reproduce the issue with the previous methods, although I have had one lockup that seemed to happen randomly (i.e., no obvious process trigger).

----------

