# Kernel 3.7.9 crashes (nouveau?)

## tomtom69

Hi,

since the last kernel update from 3.6.11 to 3.7.9 I get frequent freezes which seem to be graphics related.

It happens for example when using googleearh, mplayer or some other applications which use graphics with more bandwidth. glxgears is also a candidate to cause the lockup.

When the kernel crashes I only get stripes on the screen. Sometimes the picture freezes for some seconds until the stripes appear. No blinking LEDs, no chance to change terminal, no ssh, only reset is possible.

Unfortunately also no output in dmesg before the crash.

Chipset is Nvidia GeForce 7025 / nForce 630a.

Reverting to kernel 3.6.11 makes the crashes disappear.

Is this a known issue?

Kernel config is here:

http://pastebin.com/cv53m85H

tom

----------

## Maitreya

For stripes to appear after a crash would seem like a memory issue on the videocard. This could be heat. Did any powersaving/fan options changed during the kernel upgrade or could you monitor temperatures?

----------

## Chris W

I've been seeing hard locks on NVidia hardware with 3.7.10 Gentoo sources (a MythTV box).  Have not found the cause yet: they're intermittent and do not seem to have a consistent trigger.  I'm rolling back to 3.6 and I'll see how it goes.

----------

## tomtom69

Hi,

The chipset is onboard and has no specific cooling or dedicated memory. A manual temperature check of the chipset does not show any "hot spots" so I assume temperature should not be an issue. The crashes appear reproducible at the start of a graphics intensive application (kernel 3.7), where I would not assume any temperature rise, but never with kernel 3.6.11. 3D accel or such things are not used - no gaming etc.

At the moment I hope that kernel 3.8 cures this - till then I will get stuck at 3.6.11, which works without any problems since getting stable.

tom

----------

## TomWij

There has been refactoring in Nouveau in the 3.7 branch, you will definitely want to avoid it and prefer an earlier or the latest kernel.

----------

## ian.au

Irritating, I keep away from nVidia cards in general for my Gentoo machines.

I have an old Toshiba Satellite Laptop P-20 Laptop using NV34M [GeForce FX Go5200 64M] which is totally broken under 3.7.10 (loses graphics altogether as soon as the module comes up.

Another machine here runs GT218 [GeForce 210] graphics and seems to be fine so far under 3.7.10.

I.

----------

## TomWij

 *ian.au wrote:*   

> Another machine here runs GT218 [GeForce 210] graphics and seems to be fine so far under 3.7.10.

 

Fixed that one on my side. What does dmesg say? (Well, you can look at /var/log/messages after a reboot)

----------

## ian.au

 *Quote:*   

> Fixed that one on my side. What does dmesg say? (Well, you can look at /var/log/messages after a reboot)

 

[* Edit - correct broken card]

On the broken [GeForce FX Go5200 64M] machine, I just updated to kernel 3.8.4 and get the following dmesg (same problem as 2.7.10 wrt this kernel, machine boots, no screen   :Evil or Very Mad: 

Seems to just die silently. Machine is running fine, I'm ssh'd into it atm. 

```

ln1 ~ # dmesg |grep nouveau

[    9.134815] nouveau  [  DEVICE][0000:01:00.0] BOOT0  : 0x034400a2

[    9.134821] nouveau  [  DEVICE][0000:01:00.0] Chipset: NV34 (NV34)

[    9.134824] nouveau  [  DEVICE][0000:01:00.0] Family : NV30

[    9.136306] nouveau  [   VBIOS][0000:01:00.0] checking PRAMIN for image...

[    9.201109] nouveau  [   VBIOS][0000:01:00.0] ... checksum invalid

[    9.201116] nouveau  [   VBIOS][0000:01:00.0] checking PROM for image...

[    9.201142] nouveau  [   VBIOS][0000:01:00.0] ... signature not found

[    9.201145] nouveau  [   VBIOS][0000:01:00.0] checking ACPI for image...

[    9.201149] nouveau  [   VBIOS][0000:01:00.0] ... signature not found

[    9.201152] nouveau  [   VBIOS][0000:01:00.0] checking PCIROM for image...

[    9.202204] nouveau  [   VBIOS][0000:01:00.0] ... checksum invalid

[    9.202208] nouveau  [   VBIOS][0000:01:00.0] using image from PRAMIN

[    9.202212] nouveau  [   VBIOS][0000:01:00.0] BMP version 5.27

[    9.202314] nouveau  [   VBIOS][0000:01:00.0] version 04.34.20.25.00

[    9.202735] nouveau W[  PTIMER][0000:01:00.0] unknown input clock freq

[    9.202747] nouveau  [     PFB][0000:01:00.0] RAM type: DDR1

[    9.202752] nouveau  [     PFB][0000:01:00.0] RAM size: 64 MiB

[    9.202757] nouveau  [     PFB][0000:01:00.0]    ZCOMP: 0 tags

[    9.207606] nouveau  [     DRM] VRAM: 63 MiB

[    9.207616] nouveau  [     DRM] GART: 128 MiB

[    9.207804] nouveau  [     DRM] BMP BIOS found

[    9.207809] nouveau  [     DRM] BMP version 5.39

[    9.207817] nouveau  [     DRM] Bios version 04.34.20.25

[    9.207823] nouveau  [     DRM] DCB version 2.2

[    9.207832] nouveau  [     DRM] DCB outp 00: 030002f3 00000005

[    9.207836] nouveau  [     DRM] DCB outp 01: 01010100 00009c40

[    9.207840] nouveau  [     DRM] DCB outp 02: 02020321 00000003

[    9.208000] nouveau  [     DRM] Loading NV17 power sequencing microcode

[    9.208073] nouveau  [     DRM] BIOS FP mode: 1440x900 (96210kHz pixel clock)

[    9.208700] nouveau  [     DRM] Saving VGA fonts

[    9.293442] nouveau  [     DRM] 0 available performance level(s)

[    9.293447] nouveau  [     DRM] c: core 199MHz memory 405MHz

[    9.294374] nouveau  [     DRM] MM: using M2MF for buffer copies

[    9.294384] nouveau  [     DRM] Calling LVDS script 1:

[    9.294389] nouveau  [     DRM] Calling LVDS script 6:

[    9.294392] nouveau  [     DRM] 0xADAF: Parsing digital output script table

[    9.796236] nouveau  [     DRM] Setting dpms mode 3 on TV encoder (output 2)

[    9.843390] nouveau  [     DRM] allocated 1440x900 fb: 0x9000, bo f54aa200

[    9.843492] fbcon: nouveaufb (fb0) is primary device

[    9.855582] nouveau  [     DRM] Calling LVDS script 2:

[    9.855586] nouveau  [     DRM] 0xAEF7: Parsing digital output script table

[    9.903666] nouveau  [     DRM] Calling LVDS script 5:

[    9.903670] nouveau  [     DRM] 0xAD98: Parsing digital output script table

[    9.909290] nouveau 0000:01:00.0: fb0: nouveaufb frame buffer device

[    9.909294] nouveau 0000:01:00.0: registered panic notifier

[    9.909300] [drm] Initialized nouveau 1.1.0 20120801 for 0000:01:00.0 on minor 0

[  750.560023] nouveau  [     DRM] Calling LVDS script 6:

[  750.560026] nouveau  [     DRM] 0xADAF: Parsing digital output script table

[ 2034.748077] nouveau  [     DRM] Calling LVDS script 2:

[ 2034.748080] nouveau  [     DRM] 0xAEF7: Parsing digital output script table

[ 2034.796154] nouveau  [     DRM] Calling LVDS script 5:

[ 2034.796156] nouveau  [     DRM] 0xAD98: Parsing digital output script table 
```

Paste of the entire dmesg here http://pastebin.com/ck0PfRsH

Cheers, 

IanLast edited by ian.au on Sun Apr 07, 2013 1:08 am; edited 2 times in total

----------

## TomWij

 *ian.au wrote:*   

> Seems to just die silently.

 

That's unfortunate, did you verify it is not your DE? (/var/log/Xorg.0.log, /var/log/messages, ...)

You may ask on IRC in #nouveau on FreeNode if there are additional debugging techniques for silent problems like these.

The other option is to do a http://wiki.gentoo.org/wiki/Kernel_git-bisect if you have some time, that seems like the only way to really find the issue I think.

(See the help information of git bisect itself, you can bisect a path and therefore limit commits to the nouveau directory and spare out some reboots)

----------

## ian.au

 *TomWij wrote:*   

>  *ian.au wrote:*   Seems to just die silently. 
> 
> That's unfortunate, did you verify it is not your DE? (/var/log/Xorg.0.log, /var/log/messages, ...)
> 
> 

 

Sorry, I put the wrong card descriptor in the above post for the broken machine, corrected that now in the above post.

Well the machine is booting into console and the screen dies on module loading, whilst processing uevents during boot and never comes back for console login; so I can't think it has anything to do with the DE. 

I use metalog, so no /var/log/messages but the relevant logs are clean, to all intents and purposes the system thinks it's operating normally.

I'm thinking this problem is rooted at a lower level, I note that on the broken machine I have an uncachable register:

[GeForce FX Go5200 64M] x86

```

ln1 log # cat /proc/mtrr

reg00: base=0x000000000 (    0MB), size= 2048MB, count=1: write-back

reg01: base=0x07ff80000 ( 2047MB), size=  512KB, count=1: uncachable

reg02: base=0x0e0000000 ( 3584MB), size=  128MB, count=1: write-combining

```

Whilst on the machine that runs fine: [GeForce 210] amd64

```

lw1 ~ # cat /proc/mtrr

reg00: base=0x000000000 (    0MB), size= 2048MB, count=1: write-back

reg01: base=0x0c0000000 ( 3072MB), size=  256MB, count=1: write-combining 
```

Maybe that's tripping the later kernel up on the x86 machine. Anyway, is running fine on kernel 3.5.7 so I've reverted to that on the x86 arch for the time being.

I may wait for the next stable release on x86 and see how that goes on the old laptop, I just don't have time to dig through this at the moment.

Thanks for taking an interest,

Ian

----------

## tomtom69

Update:

Problem stays with kernel 3.8.13

Looks like I need to drop nouveau and use the nvidia blob again.

----------

## wcg

Does enabling the MTRR sanitizer by default help with the nforce630a

chipset? I have an nforce430 machine with that enabled, and it works

on that machine (no dmesg complaints about chunk size, etc). The gpu

is Ge6150SE.

In your kernel .config, I noticed that you have

```

CONFIG_MTRR_SANITIZER_ENABLE_DEFAULT=0

```

(IIRC that disables it by default at boot.)

On a 990FX chipset machine with a gt218, the mtrr sanitizer code in the kernel

simply does not work, but the gpu works fine with 3.7.10 nouveau anyway,

meaning either the BIOS mtrr settings are usable or the kernel is using

a newer alternative to mtrr available with newer cpu models (I saw a vague

note about that while reading the help for various options in make menuconfig.)

So, nouveau in 3.7.10 is working with the Ge6150SE in the nforce430 chipset

with the mtrr sanitizer enabled. I have not stressed it particularly, and it

did hang once or twice, but the first hang went away when I re-emerged the

xorg nouveau driver, and the second one seemed like something not related

to video (flaky motherboard that complains about various wierd power-related

things from time to time; not reproducible).

Anyway, try enabling the MTRR sanitizer with the nforce630a and see if that

helps nouveau on newer kernels.

edit:

Caveat: I compile xorg with "USE=-udev" and use the xorg mouse and keyboard

drivers rather than evdev for input drivers. So I am possibly not getting uevents.

(I do not know if that matters.)

----------

## wcg

Actually I just started seeing this on the nforce430-Ge6150SE board with

4gb of ram and scads of swap when firefox-17.0.{5,6} starts, on kernel

3.7.10 (I know I tested this after the kernel was installed, and it worked

after re-emerging the xorg nouveau driver; apparently that was only luck)

and kernel 3.8.13. No problem with kernel 3.5.7, and no problem

with 3.7.10 on another box that has a 990fx chipset and a gt218 gpu

in a PCIe slot.

I ran memtest86, guessing it might be a memory problem (that's what firefox

does differently than most other processes that run in xfce4 in X,

allocate lots of memory when it starts), but the dimms tested with

0 errors (actually I suspected a dimm socket rather than the dimms

themselves, but neither produced any errors). Does not happen when

I start evince and load a .pdf, gimp, emacs, etc.

Hangs the kernel (cannot ssh in and kill the process, because the kernel

is no longer running, so the network is not responding). No messages

in /var/log/*, no xorg messages, etc.

I have a spare nvidia PCIe x16 card I can try in the nforce430 box and

see if it happens with a 3.7.10 or 3.8.x kernel. (The Ge6150SE onboard

video will reserve some of main memory for a video memory buffer;

a PCIe card will not, yet they will use the same video driver.)

But it could be other things than the video driver (something to do with chipset

setup in the BIOS). The system hangs before firefox ever displays the page.

----------

## _______0

upgrade card's bios, many seem to shipped with questionable stability.

Nouveau had fixed an important bug in 3.9, can't recall the details but might be worth trying. Bear in mind that nouvau lags behind in sophistication due to nvidias attitued. On the other hand radeon's are rock solid.

I experienced similar issues, search my posts. What I did is to purposely stress the card with many things until triggering the darn bug. It's fun because it's predictable. Without stressing it's all aight though.

let me know about 3.9.

----------

## tomtom69

Hi,

I tried the following:

(1) CONFIG_MTRR_SANITIZER_ENABLE_DEFAULT=1

(2) Disabling MTRR support completely

(3) increase the shared memory size from 64MB to 256MB

but unfortunately everything without success. The time to crash varies, but it was always less than some minutes. Reminds me to Win9x which caused me to move to linux a long time ago.

Hardware defect should not be the case because I can see the fault on 2 systems with identical chipset. And I have a system with a different chipset (GeForce 6150SE) but nearly same kernel .config which works without problems using nouveau.

I'll give kernel 3.9 a try as soon as it hits stable and report the results.

tom

----------

## wcg

Yes, I don't think it is really hardware, or at least not the Ge 6150SE video

hardware. That hardware works fine with nouveau on kernel 3.5.7. firefox-17.0.5

and firefox-17.0.6 load on that kernel and run with no problems. That gpu-driver

combination worked fine on 3.3.8, 3.2.12, 3.1.6, and so on.

It is a new bug (new for me with 3.7.x kernels, anyway), and it is intermittent.

It seems to depend on one's memory allocation pattern before the offending

process runs whether the kernel hangs or not (reference the time firefox

worked on 3.7.10 right after I re-emerged the xorg nouveau driver and

restarted xorg).

So, I guess one would have to step up kernel versions one version at a time

until seeing the hang, then git bisect it to find the actual kernel patch that

allows it to happen. (Unless one has more money than time and can just

replace the mb and/or gpu with products that don't trigger the hang.)

If it is fixed in kernel 3.9.x, that would be cool (someone else already found

it and fixed it).

----------

## tomtom69

GE6150SE works for me on one machine. What does not work is GE7025.

But I do not know how to get and apply all the patches from 3.6.11 to 3.7.10 in order to see when the problem apperared first.

For filing a bug I think some more information would be necessary than just "hangs with striped screen". However the crash is so "quick and heavy" that things like sysrq key or log files are all empty.

----------

## wcg

I would have said kernel.org to get the source to any kernel version,

but connecting to http://www.kernel.org/ seems to be broken for

this. One gets a front page with a filtered list instead of a directory

listing that one can navigate to a directory with base 3.x kernel source

trees and then a long list of 3.x.x patches.

(I am *not* reconfiguring my router to let in remotely initiated

connect attempts so that ftp works.)

However, connecting to https://www.kernel.org/pub/linux/kernel/v3.x/

gets one to the traditional directory listing of kernel source trees

and version patches.

edit:

These would not be managed by portage, of course, and I don't know

if genkernel will work with kernel.org source trees. Old school.

----------

## TomWij

Please file this at https://bugs.gentoo.org as we can't actively track the forums to fix things, thanks in advance.

----------

## tomtom69

OK. Bug 472200 Submitted. Hopefully with all information needed.

----------

