# Irregular complete system freeze

## Chopstix

I've been experiencing this issue probably ever since I had this high end laptop, at least for the last two years.

Every now and then, especially when using a program or game that uses the GPU, Xonotic (game) or Autopano Giga (program like Hugin for stitching panoramas that uses the GPU for preview and rendering), the image on screen will freeze along with the audio. Either nothing will change and it remains frozen (reboot via Magic SysRq key works), or if I Alt+Tab, sometimes (but usually not) after a few seconds the system will return to life, but then usually after some 20 seconds the system freezes completely, so I use these few seconds to cleanly reboot. Other times, instead of the image on screen freezing, the whole screen will turn grey and flicker or flash in an alarming way, reminiscent of an epileptic seizure. This is accompanied by two lights above my keyboard flashing (They have padlock signs, no idea what they mean, I will try to get a picture of them next time it happens). Magic SysRq keys don't work. Only option is to power off.

Though rarely, this has happened when I was just doing something in KDE4, though I guess it was using the GPU for compositing. Happens very often when I use 3D programs/games. I could say the system has a half-life of an hour - half the time this will happen within the hour.

My GPU is an nVidia GTX 285M using proprietary nvidia-drivers, currently 331.20 but this has been happening since 2xx.

Yesterday I tried using my tablet to SSH into my machine and monitor some /var/log files while playing, hoping to see some relevant messages appear. Very unusually, I must have played for 2 hours and nothing happened.

Tried again today without the tablet, happened after half an hour. I got these in /var/log/messages:

```
17462 Nov 10 13:40:01 overkill cron[22043]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons)

17463 Nov 10 13:46:14 overkill kernel: [10973.436884] perf samples too long (5105 > 5000), lowering kernel.perf_event_max_sample_rate to 25000

17464 Nov 10 13:50:01 overkill cron[22065]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons)

17465 Nov 10 13:51:13 overkill kernel: [11272.575627] [sched_delayed] sched: RT throttling activated

17466 Nov 10 13:51:19 overkill kernel: [11278.583775] ehci-pci 0000:00:1d.0: iso underrun ffff8800c07f9a00 (7316+56 < 14617)

17467 Nov 10 13:51:19 overkill kernel: [11278.583782] delay: estimated 384, actual 0

17468 Nov 10 13:51:19 overkill kernel: [11278.583789] ehci-pci 0000:00:1d.0: iso underrun ffff8800c07fa000 (7380+56 < 14618)

17469 Nov 10 13:51:19 overkill kernel: [11278.583793] delay: estimated 384, actual 96

(...)

17516 Nov 10 13:51:19 overkill kernel: [11278.584055] ehci-pci 0000:00:1d.0: iso underrun ffff8800c07fa000 (724+56 < 6428)

17517 Nov 10 13:51:19 overkill kernel: [11278.584059] delay: estimated 480, actual 96

17518 Nov 10 13:51:19 overkill kernel: [11278.584066] ehci-pci 0000:00:1d.0: iso underrun ffff8800c07f9400 (788+56 < 6428)

17519 Nov 10 13:51:19 overkill kernel: [11278.584071] delay: estimated 480, actual 96
```

I found this in Xorg.log.old, I'm not sure whether these appeared during the freeze because there are no timestamps, but I'll provide them just in case:

```
[ 11265.627] (WW) NVIDIA(0): WAIT (2, 4, 0x8000, 0x0000d6ac, 0x00001580)

[ 11273.663] (WW) NVIDIA(0): WAIT (1, 4, 0x8000, 0x0000d6ac, 0x00001580)

[ 11276.664] (WW) NVIDIA(0): WAIT (2, 4, 0x8000, 0x0000d6ac, 0x000026ac)

[ 11281.667] (WW) NVIDIA(0): WAIT (0, 4, 0x0000, 0x000026ac, 0x000026ac)
```

I keep my Gentoo machine up to date, running mostly amd64 stable, just some ~amd64 packages for a few userland programs like Gimp. This has been a constant problem for at least two years, back in the gentoo-sources-2.* days.

lspci -v

lsusb -v

emerge --info

Please help.Last edited by Chopstix on Sun Nov 10, 2013 5:58 pm; edited 1 time in total

----------

## Hu

Please post the output of emerge --info.

The flashing LEDs usually indicate a kernel panic.

I see you are using an nVidia card.  Are you using the open nVidia driver or the proprietary nVidia driver?  If the latter, please try to reproduce the problem with an untainted kernel.

----------

## Chopstix

emerge --info link added to post, along with GPU info.

I use nvidia-drivers, currently 331.20.

What does "untainted kernel" mean? If you mean nouveau, I can't test that as I get 2fps in Xonotic and had some compatibility issues in Autopano Giga too.

----------

## Hu

If you load a proprietary kernel module, your kernel is tainted.  Proprietary modules have a history of causing weird problems and are difficult to support, so upstream generally does not support a kernel if the problem manifests only when the kernel is tainted.  If you cannot reproduce the problem with an untainted kernel, you will probably need to seek support from the provider of the proprietary module, nVidia.

----------

## shazeal

Random errors generally indicate a hardware problem not software. GPU could be overheating or Memory on the card is bad. It could be some other component of the card though. I have had several cards which exhibited similar behaviour, exchanging for a different card fixed it every time.

The easiest way to test this out is to boot the card into a different OS like Windows, if you can crash it there too its definitely hardware. Otherwise monitor temps, if the Heatsink was never seated correctly/paste is dried out it could cause something like this. 

This was a popular card and is mature, so if google does not come up with anything its generally your card at fault.

----------

## Chopstix

I was worried that could be the case.

What about this error message, what does it mean?

ehci-pci 0000:00:1d.0: iso underrun

----------

## eccerr0r

It almost seems that video cards only last like a few years and guaranteed to die after that...

Seems all of my discrete GPUs so far have developed problems after about 2 years of use.  The onboard ones tend to survive, and the on-cpu ones I don't have enough data yet...

USB iso underrun likely is that you have an isochronous stream on a usb device but it didn't get enough data for whatever reason, might be a timing problem caused by another device...

----------

