# Strange lockup [amd74xx ide resume SOLVED]

## pgolik

I finally got suspend to ram working on my desktop system. It's an amd64 with nForce3 MB, nvidia GF5200 with the binary driver, kernel 2.6.16-r9 (stable gentoo sources) and Xorg 7.0. I use the hibernate script with a pretty default configuration for suspending to RAM (only commented out nvidia in blacklisted-modules). 

It suspends perfectly and wakes up correctly most of the time, with video, X and all that, but sometimes (not every time) it locks up hard about 3 minutes after resuming. Immediately after resuming it would work normally, allow me to open programs etc., and then, after a few minutes it locks up. The display freezes (but isn't garbled in any way), CapsLock and NumLock LEDS start flashing, keyboard and mouse become of course totally unresponsive (no magic sysrq) and the HDD LED stays on (but I can't hear the disk spin). I can't ssh into the machine, it doesn't even answer to pings, so it looks like a kernel level lockup. Usually it happens after an overnight suspension, not when I wake it up after only a few minutes of suspend. It may coincide with cron starting some disk activity.

The only unusual thing I found in the system logs (but only once) was this line:

```
Jul  4 11:40:25 [kernel] hda: dma_timer_expiry: dma status == 0x21
```

My primary Linux drive is on SATA (/dev/sda), hda is a PATA drive I use for windows and some secondary storage on FAT32.

I realize this isn't enough information, but I don't even know where to look. Any ideas?Last edited by pgolik on Fri Nov 10, 2006 8:41 pm; edited 4 times in total

----------

## Raffi

Unless I misread what you wrote, I think you answered your own question. The nvidia module was probably blacklisted for a good reason.

Is it possible for you to try a different supported card and see if it still locks up?

----------

## pgolik

Many users report that the newest binary drivers from nvidia support hibernation (I'm using the 8762 release). They were blacklisted as earlier driver releases didn't work. Meanwhile I enabled SMART checking on all my drives, but they report perfect health. 

Perhaps it's the new modular xorg that lockups like that, but it's still strange that it happens not immediately after wakeup, but after a few minutes of seemingly normal functioning.

----------

## Raffi

You might try the following experiment. Reboot your machine so that X will not come up (this will also keep the nvidia stuff from loading even once). Try to hibernate and restore. Try everything except X stuff and see if you stay up. If so, start X and see if you stay up.

This way you should be able to either eliminate X as the problem or prove that it is probably X related.

----------

## pgolik

It restores correctly, X and all, about 9 out of 10 tries. Another "interesting" development: this time it resumed, albeit slowly and spewed lots of complaints about not being able to write files. All the filesystems were read-only after resuming! So I'm leaning to think that it's some problem with the disk subsystem. Most of the times it wakes up correctly, sometimes it brings back all filesystems as read-only, and sometimes it locks up hard. 

I also found this in the logs:

```
Jul  4 18:42:29 [kernel] ata1: error=0x04 { DriveStatusError }

Jul  4 18:42:29 [kernel] ata1: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00

Jul  4 18:42:29 [kernel] ata1: status=0x51 { DriveReady SeekComplete Error }

Jul  4 18:42:29 [kernel] ata1: error=0x04 { DriveStatusError }

```

repeated. I'm beginning to think that SATA and suspend don't mix well on my hardware.

  Pawel

----------

## Raffi

I agree with your assessment, but don't know what to suggest.

----------

## pgolik

Today it happened hours after resuming, so probably it's not even related to suspend. It seems to happen when cron starts some disk-intensive task (like updatedb). And I got another

```
[kernel] hda: dma_timer_expiry: dma status == 0x21
```

 message in the log seconds before the lockup. The strange thing is I haven't changed anything in hardware recently (disk, chipset, cables) and it's been stable for a year. I'll try to recompile the kernel, any sugestions as to where to look (no, I don't want to disable dma).

----------

## Raffi

I agree you don't want to disable dma.

Are you running smartd? The drive could be failing.

----------

## pgolik

 *Raffi wrote:*   

> Are you running smartd? 

 

I am. No problems reported. Someone on suspend2 list suggests re-applying hdparm settings on each resume. I added RestartServices hdparm to the hibernate script. Will see if it helps.

----------

## pgolik

Tried restarting hdparm upon resume - didn't help. I added DisableWriteCacheOn option on all my drives - perhaps the lockups are less frequent, but they still happen. Here is a log from the last one

```

Jul  8 22:55:33 [kernel] ATA: abnormal status 0x80 on port 0x9F7

                - Last output repeated twice -

Jul  8 22:55:33 [kernel] Restarting tasks... done

Jul  8 22:55:33 [kernel] input: PS2++ Logitech MX Mouse as /class/input/input3

Jul  8 22:55:33 [kernel] ata1: command 0x35 timeout, stat 0x80 host_stat 0x1

Jul  8 22:55:33 [kernel] ata1: translated ATA stat/err 0x80/00 to SCSI SK/ASC/ASCQ 0xb/47/00

Jul  8 22:55:33 [kernel] ata1: status=0x80 { Busy }

Jul  8 22:55:33 [kernel] sd 0:0:0:0: SCSI error: return code = 0x8000002

Jul  8 22:55:33 [kernel] sda: Current: sense key=0xb

Jul  8 22:55:33 [kernel]     ASC=0x47 ASCQ=0x0

Jul  8 22:55:33 [kernel] end_request: I/O error, dev sda, sector 498143

Jul  8 22:55:33 [kernel] Buffer I/O error on device sda2, logical block 16

Jul  8 22:55:33 [kernel] lost page write due to I/O error on sda2

Jul  8 22:55:33 [kernel] ATA: abnormal status 0x80 on port 0x9F7

                - Last output repeated 2 times -

Jul  8 22:55:33 [kernel] nv_sata: Primary device added

Jul  8 22:55:33 [kernel] nv_sata: Primary device removed

Jul  8 22:55:33 [kernel] nv_sata: Secondary device removed

```

So it appears that both SATA and PATA sometimes fail to resume correctly. I have no idea what it depends on, though. They resume correctly most of the time. I saw some posts on LKML list with a similar problem here, so at least I know I'm not alone.

[Edit]

It appears the problem is known to the kernel hackers at LKML and still present as of 2.6.17 kernel release - if something tries to access the disk immediately upon resume it may timeout and give errors like I observed. Several patches exist, but none that deserve integrating into the kernel yet. One person reported, that a patch from Andrew Morton mm kernel is supposed to solve it. I found a libata_resume_fix patch here, and it applies cleanly to the current stable gentoo-sources 2.6.16-r12, will try it for a couple of days and keep y'all updated.

----------

## devsk

please make sure you report if that works. I am seeing similar issues with both ATA and SATA drivers after resume.

----------

## devsk

 *pgolik wrote:*   

> Today it happened hours after resuming, so probably it's not even related to suspend. It seems to happen when cron starts some disk-intensive task (like updatedb). And I got another
> 
> ```
> [kernel] hda: dma_timer_expiry: dma status == 0x21
> ```
> ...

 this can be worked around by setting correct UDMA mode just after resuming and just before mounting the device. If this is your root device, you are outof luck.

----------

## pgolik

It works - sort of. I'm not getting the hard lockups as before, but often after resuming I can't use my DVDR drive - k3b just sits doing nothing and I get a message about DMA timeout on device in the logs. Restarting the hdparm service (which sets dma again) works, but not all the time (sometimes it locks up hard trying to re-set DMA mode).

Now I'm trying the libata_resume patch together with the ide.c patch attached to Bug 2039 in the kernel. Will report after a couple of tries - due to the somewhat random character of the lockup I cannot fully confirm that the patch worked until I've used it for at least a couple of days. 

So far it seems that the libata_resume patch helped with the problem with my SATA drive (which is my root device), but the problem with the ATA devices (secondary HD and DVDR) remains. 

I can confirm that neither patch had any negative effect on my system, so why not try them yourself - two testcases are better than one.

I've also heard that there is ongoing work in the kernel community to address power management issues with (S)ATA in the 2.6.18 release.

----------

## pgolik

Update: the SATA drive (which is the root device) resumes fine. All the problems I've reported are related to PATA (IDE) devices (secondary HD and DVDR) and these are  not solved by any of the patches or solutions I've tried. The lockups are caused by cron starting updatedb on the secondary HD (disabled it for the time being). Restarting hdparm does reset the DMA flag on the DVDR, but it does not prevent the timeout errors. Here's a relevant log, the message about ATAPI reset appeared after I reset DMA with hdparm.

```

Jul 15 00:40:17 [kernel] hdc: DMA disabled

Jul 15 00:40:17 [kernel] hdc: ide_intr: huh? expected NULL handler on exit

Jul 15 00:40:17 [kernel] hdc: ATAPI reset complete

Jul 15 00:41:53 [kernel] hdc: cdrom_decode_status: status=0x51 { DriveReady SeekComplete Error }

Jul 15 00:41:53 [kernel] hdc: cdrom_decode_status: error=0x44 { AbortedCommand LastFailedSense=0x04 }

Jul 15 00:41:53 [kernel] ide: failed opcode was: unknown

Jul 15 00:41:59 [kernel] hdc: cdrom_decode_status: status=0x51 { DriveReady SeekComplete Error }

Jul 15 00:41:59 [kernel] hdc: cdrom_decode_status: error=0x44 { AbortedCommand LastFailedSense=0x04 }

Jul 15 00:41:59 [kernel] ide: failed opcode was: unknown

Jul 15 00:42:06 [kernel] hdc: cdrom_decode_status: status=0x51 { DriveReady SeekComplete Error }

Jul 15 00:42:06 [kernel] hdc: cdrom_decode_status: error=0x44 { AbortedCommand LastFailedSense=0x04 }

Jul 15 00:42:06 [kernel] ide: failed opcode was: unknown

Jul 15 00:42:12 [kernel] hdc: cdrom_decode_status: status=0x51 { DriveReady SeekComplete Error }

Jul 15 00:42:12 [kernel] hdc: cdrom_decode_status: error=0x44 { AbortedCommand LastFailedSense=0x04 }

Jul 15 00:42:12 [kernel] ide: failed opcode was: unknown

Jul 15 00:42:12 [kernel] hdc: DMA disabled

Jul 15 00:42:12 [kernel] hdc: ide_intr: huh? expected NULL handler on exit

Jul 15 00:42:12 [kernel] hdc: ATAPI reset complete

Jul 15 00:42:12 [kernel] ISO 9660 Extensions: Microsoft Joliet Level 3

Jul 15 00:42:12 [kernel] ISO 9660 Extensions: RRIP_1991A

Jul 15 00:42:42 [kernel] hdc: tray open

Jul 15 00:42:42 [kernel] end_request: I/O error, dev hdc, sector 64

Jul 15 00:42:42 [kernel] Buffer I/O error on device hdc, logical block 8

Jul 15 00:42:42 [kernel] hdc: tray open

Jul 15 00:42:42 [kernel] end_request: I/O error, dev hdc, sector 64

Jul 15 00:42:42 [kernel] Buffer I/O error on device hdc, logical block 8

Jul 15 00:42:42 [kernel] hdc: tray open

Jul 15 00:42:42 [kernel] end_request: I/O error, dev hdc, sector 64

Jul 15 00:42:42 [kernel] Buffer I/O error on device hdc, logical block 8

Jul 15 00:42:42 [kernel] hdc: tray open

Jul 15 00:42:42 [kernel] end_request: I/O error, dev hdc, sector 64

Jul 15 00:42:42 [kernel] Buffer I/O error on device hdc, logical block 8

```

To summarize - IDE (but not SATA) devices do not resume correctly, none of the solutions I've found works. As my root and boot partitions are on SATA I'll try to build all of the ide subsystem as modules and unload it before hibernating - but that's not a solution, just a workaround (if it works at all).

----------

## pgolik

Finally there is a fix. This patch fixes the IDE resume problem on my hardware and I have a fully functional suspend to RAM. So far no negative side effects of the patch. The patch was made for 2.6.18 but it patches gentoo-sources-2.6.17 without any problems.

----------

