# nvidia-driver crashes, how to debug?

## Iesos

Hi,

 I'm here to look for some advice and help.

 I have a Dell L502X (XPS 15), with a optimus nvidia card, and running some 3D programs, the nvidia driver can crash.

 I have been trying to debug this for some time now, and I'm _very_ sure it is not because of the optirun/bumblebee implementation, wine, or kernel configuration. (since the same crash exists also in windows.). What I need to fully determine is: Is this crash due to: Hardware failure, or a "feature" in both the linux and windows driver.

 What I want help with is to determine which it is, and how to convince Dell or nVidia, that this is the case. So, here follows a description of my debugging efforts.

 There are a bunch of things that seem to be happening during the crash. The first thing is that the nvidia driver spits out an error:

 *Quote:*   

> Apr 21 16:51:39 localhost kernel: NVRM: Xid (0000:01:00): 13, 0003 00000000 00009197 00002480 0054a001 00000000
> 
> Apr 21 16:51:39 localhost kernel: NVRM: Xid (0000:01:00): 39, CCMDs 00000004 000090b5
> 
> Apr 21 16:51:41 localhost kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
> ...

 

the Xid errors are some internal debugging messages for nvidia, and they do not seem to answer, when asked, about what these numbers mean.

 The next thing that happens is different on different kernels, but that is because of a bug in the i915 driver, since this driver seem to crash:

 *Quote:*   

> Apr 21 16:52:03 localhost kernel: ------------[ cut here ]------------
> 
> Apr 21 16:52:03 localhost kernel: WARNING: at drivers/gpu/drm/i915/i915_irq.c:652 ironlake_irq_handler+0x4f2/0x500()
> 
> Apr 21 16:52:03 localhost kernel: Hardware name: Dell System XPS L502X
> ...

 

This does not seem to be a problem, and I can get rid of this message using kernel versions >=3.3.

 Then I get

 *Quote:*   

> Apr 21 16:52:07 localhost kernel: ACPI Exception: AE_TIME, Returned by Handler for [EmbeddedControl] (20110623/evregion-478)
> 
> Apr 21 16:52:07 localhost kernel: ACPI Error:

 

and another i915 warning. Then

 *Quote:*   

> Apr 21 16:52:15 localhost kernel: Clocksource tsc unstable (delta = -1955188294 ns)
> 
> Apr 21 16:52:15 localhost kernel: Switching to clocksource hpet

 

I googled around for the "hpet" line, and found that I should switch to hpet at boot to get rid of this, but that only makes that message go away, not solve the nvidia crash.

Then there are some more i915 warnings and

 *Quote:*   

> Apr 21 16:52:45 localhost kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
> 
> Apr 21 16:52:49 localhost kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
> 
> Apr 21 16:52:51 localhost kernel: NVRM: GPU at 0000:01:00.0 has fallen off the bus.

 

is the last thing I hear from the nvidia-driver. And then I have:

 *Quote:*   

> Apr 21 16:53:51 localhost kernel: INFO: rcu_sched detected stall on CPU 3 (t=6000 jiffies)
> 
> Apr 21 16:53:51 localhost kernel: Pid: 23864, comm: SC2.exe Tainted: P        W  O 3.2.12-gentoo-jesus19 #4
> 
> Apr 21 16:53:51 localhost kernel: Call Trace:
> ...

 

this rcu-messages here seem to say that SC.exe have occupied CPU #3 and according to the trace, it is nvidias fault.

The full crash log can be found here: http://pastebin.com/wYcFtGCY

In windows, the same crash exists. However, the only error I can get is "The nvidia driver stopped responding and has now been reloaded".

I have so far tried wine versions from 1.2.something to 1.5.something. Several versions of the nvidia-driver (in linux and windows), I have upgraded BIOS, I have reinstalled bumblebee.

So, more direct questions I have:

 - What does the hpet message tell me? Does it matter?

 - What is the "GPU has fallen of the bus"?

 - What does the rcu-messages mean? Can they be circumvented?

 - What else can I do to get more information about this crash?

 - Is the crash due to the nvidia-driver or because of a hardware problem?

----------

## eccerr0r

If it also crashes in Windows, it's very likely it's a hardware issue.

HPET is a high precision event timer, it's generally something good unless your hardware doesn't work.

TimeStampCounter is an internal timer counter to the CPU, and when it's "unstable" it means software has detected it's not monotonically increasing and thus cannot be used to measure time.

The "falling off the bus" error was meant to be a serious but humorous error that the GPU seemed to have "disappeared" from the bus, and will no longer communicate.  It could be due to it "disconnecting" for normal reasons but likely it's due to hardware issues.

Likely the stall was due to some bad software-hardware interaction.  If the "GPU fell off the bus" sometimes bad software did not handle this "condition" properly and will hang the CPU, and the kernel detected this situation.

I'd probably first check the fan of the GPU and see if it's clean.  Make sure it's not overheating.

If it's clean and not overheating I'd look into a RMA.  And there's a reason why I like desktops over laptops, I can toss out the video card and get another one to test my assumptions...  The closed source proprietary driver makes it that much harder to debug too, if you can use the OSS Nouveau driver, it could give more hints or even work properly because it uses a different method of accessing the hardware... but I suspect it won't work properly either.

I've found the Nvidia closed source driver very stable, and any issues with it tends to be hardware problems if it was not force-built with a kernel it was not meant to be linked with.  I had two nvidia cards (a GeForce4 MX420 and a GeForce 8400GS), both worked perfectly until they were removed from their systems (one died due to fan failure, other I no longer needed after getting the onboard video working.)

----------

## MRJonnyH

Hi, I just registered because I'm also looking for safe config settings for XPS l502x.

I have an AOC LCD upstairs, connecting via HDMI>DVI adaptor. 

 This exhibits problems 

    (screen goes blank except for a large, pixelated square....)

 Sometimes it causes the OS to freeze (hard reset required), 

 sometimes it handles the exception more gracefully (system notification message ambiguously reports 'nvidia display driver service stopped/restarted' {sic}

We also have a HUGE Samsung HDMI tv. This one also has a bug, similar to above, except sometimes the OS freezes, then a hard reset is required, but there are no critical errors/warnings in event viewer.

So how to debug?

I have cpu-z (just downloaded gpu-z also). 

Not really a gamer (except backgammon), just really want a bulletproof build.

Any suggestions welcome (I don't know how the above poster even got that level of detail... hence my registering and waffling....)

THANKS !

----------

## Hu

 *MRJonnyH wrote:*   

>  Sometimes it causes the OS to freeze (hard reset required), 
> 
>  sometimes it handles the exception more gracefully (system notification message ambiguously reports 'nvidia display driver service stopped/restarted' {sic}
> 
> I have cpu-z (just downloaded gpu-z also).

 These statements make me wonder whether you are in the right place.  The first hit I found for gpu-z is a Windows-only program, so unless you are dual-booting Windows with Gentoo Linux, I do not see how you could use gpu-z and also be running the right environment for us to help you.  Similarly, the "system notification message" makes me suspect you are using Windows.  Although some Linux desktops offer something like that, I am not aware of any which would use that terminology and resolve a problem in that way.

This forum is dedicated to Gentoo Linux in particular, though we can handle more general Linux questions in some cases.  You might find someone here who can help you with a Windows problem, but that would be by luck.  I do not mean to run you off, but if you need help with a Windows-specific problem, there are other places that are more likely to give you a timely and detailed answer.

If I am wrong and you are using some form of Linux, then I apologize and suggest you start with posting the output of dmesg | tail -n200 right after a hang recovery occurs.

----------

