# Nvidia 304.76 + kernel 3.19.8-gentoo + quadro FX 570

## davidm

Hi, I used to use the Nouveau driver for over a year or so but recently started having issues with it freezing up and showing various errors in dmesg and syslog.  This happened almost exactly after upgrading from Plasma 5.3.1 to Plasma 5.3.2 -- could be a coincidence)  So I switched over to the latest Nvidia proprietary driver for my hardware which is 340.76.  I then went ahead and downgraded to the latest compatible kernel I could use without special patching which is kernel 3.19.8-gentoo.

The problem is, although the Nvidia binary seems to crash less often and with less errors, I am still seeing occaisional graphics crashes and odd errors in dmesg and syslog.  The latest error is as below.  This one caused a freeze for a bit and seemed to occur when using Google Chrome (there seems to be a correlation as problems seem more common when using Google Chrome)

```

ul  9 12:53:24 gentoo kernel: NVRM: GPU at PCI:0000:01:00: GPU-86a1f1c5-cdc2-019c-9551-935a3421a183

Jul  9 12:53:24 gentoo kernel: NVRM: Xid (PCI:0000:01:00): 69, Class Error: ChId 000f, Class 0000502d, Offset 00000250, Data ffffffff, ErrorCode 0000000c

Jul  9 12:53:26 gentoo kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

Jul  9 12:53:30 gentoo kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

Jul  9 12:53:32 gentoo kernel: ------------[ cut here ]------------

Jul  9 12:53:32 gentoo kernel: WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:303 dev_watchdog+0x22e/0x240()

Jul  9 12:53:32 gentoo kernel: NETDEV WATCHDOG: enp4s0 (tg3): transmit queue 0 timed out

Jul  9 12:53:32 gentoo kernel: Modules linked in: nvidia(PO) sha1_generic

Jul  9 12:53:32 gentoo kernel: CPU: 2 PID: 0 Comm: swapper/2 Tainted: P           O   3.19.8-gentoo #1

Jul  9 12:53:32 gentoo kernel: Hardware name: Dell Inc. Precision WorkStation T3400  /0TP412, BIOS A09 06/04/2009

Jul  9 12:53:32 gentoo kernel:  ffffffff81a0094e ffff88023bc83d78 ffffffff8173ce0f 0000000000000007

Jul  9 12:53:32 gentoo kernel:  ffff88023bc83dc8 ffff88023bc83db8 ffffffff8104e565 ffff88023bc83db8

Jul  9 12:53:32 gentoo kernel:  0000000000000000 ffff8802318f2000 0000000000000002 0000000000000005

Jul  9 12:53:32 gentoo kernel: Call Trace:

Jul  9 12:53:32 gentoo kernel:  <IRQ>  [<ffffffff8173ce0f>] dump_stack+0x45/0x57

Jul  9 12:53:32 gentoo kernel:  [<ffffffff8104e565>] warn_slowpath_common+0x85/0xc0

Jul  9 12:53:32 gentoo kernel:  [<ffffffff8104e5e1>] warn_slowpath_fmt+0x41/0x50

Jul  9 12:53:32 gentoo kernel:  [<ffffffff8168fa3e>] dev_watchdog+0x22e/0x240

Jul  9 12:53:32 gentoo kernel:  [<ffffffff8168f810>] ? dev_graft_qdisc+0x80/0x80

Jul  9 12:53:32 gentoo kernel:  [<ffffffff810a14a9>] call_timer_fn+0x39/0x110

Jul  9 12:53:32 gentoo kernel:  [<ffffffff8168f810>] ? dev_graft_qdisc+0x80/0x80

Jul  9 12:53:32 gentoo kernel:  [<ffffffff810a1773>] run_timer_softirq+0x1f3/0x2d0

Jul  9 12:53:32 gentoo kernel:  [<ffffffff8105247f>] __do_softirq+0x9f/0x270

Jul  9 12:53:32 gentoo kernel:  [<ffffffff81052785>] irq_exit+0x85/0x90

Jul  9 12:53:32 gentoo kernel:  [<ffffffff81033521>] smp_apic_timer_interrupt+0x41/0x50

Jul  9 12:53:32 gentoo kernel:  [<ffffffff8174406a>] apic_timer_interrupt+0x6a/0x70

Jul  9 12:53:32 gentoo kernel:  <EOI>  [<ffffffff8100c2e8>] ? mwait_idle+0x68/0x90

Jul  9 12:53:32 gentoo kernel:  [<ffffffff8100ca4a>] arch_cpu_idle+0xa/0x10

Jul  9 12:53:32 gentoo kernel:  [<ffffffff810854c1>] cpu_startup_entry+0x321/0x360

Jul  9 12:53:32 gentoo kernel:  [<ffffffff8103189a>] start_secondary+0x13a/0x150

Jul  9 12:53:32 gentoo kernel: ---[ end trace 2eeacde4ce2772e9 ]---

Jul  9 12:53:32 gentoo kernel: tg3 0000:04:00.0 enp4s0: transmit timed out, resetting

Jul  9 12:53:32 gentoo kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

Jul  9 12:53:34 gentoo kernel: tg3 0000:04:00.0 enp4s0: 0x00000000: 0x167a14e4, 0x00100406, 0x02000002, 0x00000010

Jul  9 12:53:34 gentoo kernel: tg3 0000:04:00.0 enp4s0: 0x00000010: 0xf9ef0004, 0x00000000, 0x00000000, 0x00000000

Jul  9 12:53:34 gentoo kernel: tg3 0000:04:00.0 enp4s0: 0x00000020: 0x00000000, 0x00000000, 0x00000000, 0x02141028

```

(lots of the last few lines repeated with different data for hundreds of lines)

```

Jul  9 12:53:34 gentoo kernel: tg3 0000:04:00.0 enp4s0: 0x00007030: 0x00000000, 0x00000000, 0x000100c0, 0x00000000

Jul  9 12:53:34 gentoo kernel: tg3 0000:04:00.0 enp4s0: 0x00007400: 0x00000000, 0x000000aa, 0x00000000, 0x00000000

Jul  9 12:53:34 gentoo kernel: tg3 0000:04:00.0 enp4s0: 0: Host status block [00000001:0000005e:(0000:00e4:0000):(00e4:006c)]

Jul  9 12:53:34 gentoo kernel: tg3 0000:04:00.0 enp4s0: 0: NAPI info [00000059:00000059:(006c:006c:01ff):00df:(01a7:0000:0000:0000)]

Jul  9 12:53:34 gentoo kernel: tg3 0000:04:00.0: tg3_stop_block timed out, ofs=1400 enable_bit=2

Jul  9 12:53:34 gentoo kernel: tg3 0000:04:00.0 enp4s0: Link is down

Jul  9 12:53:34 gentoo NetworkManager[3537]: <info>  (enp4s0): link disconnected (deferring action for 4 seconds)

Jul  9 12:53:34 gentoo kernel: NVRM: Xid (PCI:0000:01:00): 69, Class Error: ChId 000f, Class 0000502d, Offset 00000250, Data ffffffff, ErrorCode 0000000c

Jul  9 12:53:36 gentoo kernel: tg3 0000:04:00.0 enp4s0: Link is up at 100 Mbps, full duplex

Jul  9 12:53:36 gentoo kernel: tg3 0000:04:00.0 enp4s0: Flow control is on for TX and on for RX

Jul  9 12:53:36 gentoo NetworkManager[3537]: <info>  (enp4s0): link connected

```

Also here is another example of an error in the past few days.  This one a bit different:

```

ul  8 16:03:57 gentoo kernel: NVRM: GPU at PCI:0000:01:00: GPU-86a1f1c5-cdc2-019c-9551-935a3421a183

Jul  8 16:03:57 gentoo kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ChID 0014, Class 00005039, Offset 00000100, Data 00000000

Jul  8 16:07:48 gentoo kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ChID 0014, Class 00005039, Offset 00000100, Data 00000000

```

I've been researching and some have suggested it is possibly hardware related.  Others were not so sure.  I guess I might try the 3.18.x LTS kernel series to test it out and also try to check my hardware more (I have ECC ram) but does anyone else have any experience or suggestions with this?  I note it seems pretty odd how it disrupted my internet connection as well according to the logs.  Also thermally there does not seem to be an issue.  The GPU is at 60 degrees Celsius according to nvidia-settings.  CPU core temps are also great:

```

sensors

coretemp-isa-0000

Adapter: ISA adapter

Core 0:       +38.0°C  (high = +84.0°C, crit = +100.0°C)

Core 1:       +37.0°C  (high = +84.0°C, crit = +100.0°C)

Core 2:       +33.0°C  (high = +84.0°C, crit = +100.0°C)

Core 3:       +35.0°C  (high = +84.0°C, crit = +100.0°C)

```

edit:  Still investigating.. https://forums.gentoo.org/viewtopic-t-1008370-view-next.html?sid=4a533fc859cd6f232068f9d7d35a8473 possibly related.

----------

## davidm

Hmmm.  Google-Chrome definitely seems to aggravate the error.  I use Firefox for my main broswer and Chrome only for certain tasks.  Once again I get the same main error (XID 69 "Class error") while playing a Youtube video in chrome, switching to Kate, and then attempting to switch back to chrome.  Upon clicking the chrome tab in the KDE plasma 5.2 panel bar this time X appears to have fully crashed and I was dumped back to sddm to login again.

```

NVRM: GPU at PCI:0000:01:00: GPU-86a1f1c5-cdc2-019c-9551-935a3421a183

[26353.720452] NVRM: Xid (PCI:0000:01:00): 69, Class Error: ChId 000f, Class 0000502d, Offset 00000250, Data ffffffff, ErrorCode 0000000c 

26355.720529] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

[26359.720597] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

[26361.711007] ------------[ cut here ]------------

[26361.711018] WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:303 dev_watchdog+0x22e/0x240()

[26361.711020] NETDEV WATCHDOG: enp4s0 (tg3): transmit queue 0 timed out

[26361.711022] Modules linked in: nvidia(PO) sha1_generic

[26361.711028] CPU: 2 PID: 0 Comm: swapper/2 Tainted: P           O   3.19.8-gentoo #1

[26361.711030] Hardware name: Dell Inc. Precision WorkStation T3400  /0TP412, BIOS A09 06/04/2009

[26361.711032]  ffffffff81a0094e ffff88023bc83d78 ffffffff8173ce0f 0000000000000007

[26361.711035]  ffff88023bc83dc8 ffff88023bc83db8 ffffffff8104e565 ffff88023bc83db8

[26361.711037]  0000000000000000 ffff8802318f2000 0000000000000002 0000000000000005

[26361.711040] Call Trace:

[26361.711042]  <IRQ>  [<ffffffff8173ce0f>] dump_stack+0x45/0x57

[26361.711051]  [<ffffffff8104e565>] warn_slowpath_common+0x85/0xc0

[26361.711054]  [<ffffffff8104e5e1>] warn_slowpath_fmt+0x41/0x50

[26361.711056]  [<ffffffff8168fa3e>] dev_watchdog+0x22e/0x240

[26361.711059]  [<ffffffff8168f810>] ? dev_graft_qdisc+0x80/0x80

[26361.711062]  [<ffffffff810a14a9>] call_timer_fn+0x39/0x110

[26361.711065]  [<ffffffff8168f810>] ? dev_graft_qdisc+0x80/0x80

[26361.711067]  [<ffffffff810a1773>] run_timer_softirq+0x1f3/0x2d0

[26361.711070]  [<ffffffff8105247f>] __do_softirq+0x9f/0x270

[26361.711073]  [<ffffffff81052785>] irq_exit+0x85/0x90

[26361.711077]  [<ffffffff81033521>] smp_apic_timer_interrupt+0x41/0x50

[26361.711080]  [<ffffffff8174406a>] apic_timer_interrupt+0x6a/0x70

[26361.711081]  <EOI>  [<ffffffff8100c2e8>] ? mwait_idle+0x68/0x90

[26361.711087]  [<ffffffff8100ca4a>] arch_cpu_idle+0xa/0x10

[26361.711091]  [<ffffffff810854c1>] cpu_startup_entry+0x321/0x360

[26361.711094]  [<ffffffff8103189a>] start_secondary+0x13a/0x150

[26361.711096] ---[ end trace 2eeacde4ce2772e9 ]---

[26361.711101] tg3 0000:04:00.0 enp4s0: transmit timed out, resetting

[26361.729454] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

[26363.528714] tg3 0000:04:00.0 enp4s0: 0x00000000: 0x167a14e4, 0x00100406, 0x02000002, 0x00000010

[26363.528718] tg3 0000:04:00.0 enp4s0: 0x00000010: 0xf9ef0004, 0x00000000, 0x00000000, 0x00000000                                     

[26363.528721] tg3 0000:04:00.0 enp4s0: 0x00000020: 0x00000000, 0x00000000, 0x00000000, 0x02141028                                     

[26363.528723] tg3 0000:04:00.0 enp4s0: 0x00000030: 0xce9e0000, 0x00000048, 0x00000000, 0x0000010a                                     

[26363.528726] tg3 0000:04:00.0 enp4s0: 0x00000040: 0x00000000, 0x00000000, 0xc0035001, 0x64002008      

```

...

```

[26363.529276] tg3 0000:04:00.0 enp4s0: 0x00007400: 0x00000000, 0x000000aa, 0x00000000, 0x00000000

[26363.529282] tg3 0000:04:00.0 enp4s0: 0: Host status block [00000001:0000005e:(0000:00e4:0000):(00e4:006c)]

[26363.529285] tg3 0000:04:00.0 enp4s0: 0: NAPI info [00000059:00000059:(006c:006c:01ff):00df:(01a7:0000:0000:0000)]

[26363.635023] tg3 0000:04:00.0: tg3_stop_block timed out, ofs=1400 enable_bit=2

[26363.655520] tg3 0000:04:00.0 enp4s0: Link is down

[26363.849455] NVRM: Xid (PCI:0000:01:00): 69, Class Error: ChId 000f, Class 0000502d, Offset 00000250, Data ffffffff, ErrorCode 0000000c

[26365.335786] tg3 0000:04:00.0 enp4s0: Link is up at 100 Mbps, full duplex

[26365.335796] tg3 0000:04:00.0 enp4s0: Flow control is on for TX and on for RX

[30352.748998] kactivitymanage[4503]: segfault at 7ff4c094ec50 ip 00007ff4b0082f91 sp 00007ffded128ad8 error 4 in libQt5Sql.so.5.4.2[7ff4b006f000+3f000]

```

Note: I do not believe the kactivity segfault is related as that happened previously as well.

I might at this point try 'emerge -eva @world' just to help rule out missing something when transitioning to the nvidia binary.

From investigating the XID errors:

http://docs.nvidia.com/deploy/xid-errors/

XID 13 suggests basically everything but hardware.

XID 69 is undocumented above but is a "class error" and others are seeing it although it seems to be rather mysterious.

----------------

edit 1 - 3:10 PM EST:

Attempting '-vdpau', 'emerge -uavDN --with-bdeps=y @world"

I'm thinking it possibly could be related to vdpau use with google chrome and mesa.  

--------------------

----------

## davidm

Wow.  I'm going to go ahead and call it "Solved" for the moment.  After a couple hours of testing removing -vdpau from make.conf and "emerge -uaVD --with-bdeps=y @world" seems to have solved it.  In particular I suspect it was removing vdpau from mesa which did the trick.  The card is so weak that vdpau makes little difference in comparison to the quad core processor so I hardly see a performance difference.

I will update this post if it returns and mark solved in the subject if it does not recur in 24 hours.  For anyone else finding these errors on similar hardware you may want to consider the solution/workaround above.

----------

## davidm

Hmmm.

```

[11355.918372] chrome[24441]: segfault at 0 ip 00007f875b08f336 sp 00007ffc1cd4ba70 error 4 in libnvidia-glcore.so.340.76[7f87598f8000+1e5e000]

[11355.938861] NVRM: Xid (PCI:0000:01:00): 69, Class Error: ChId 000e, Class 0000502d, Offset 00000250, Data ffffffff, ErrorCode 0000000c

[11356.266446] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ChID 000e, Class 00008297, Offset 00000104, Data 00000000

[11356.490216] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ChID 000e, Class 00008297, Offset 00000104, Data 00000000

[11356.954835] NVRM: Xid (PCI:0000:01:00): 69, Class Error: ChId 000e, Class 0000502d, Offset 00000250, Data ffffffff, ErrorCode 0000000c

[11357.238227] NVRM: Xid (PCI:0000:01:00): 69, Class Error: ChId 000e, Class 0000502d, Offset 00000250, Data ffffffff, ErrorCode 0000000c

[11357.254366] tg3 0000:04:00.0 enp4s0: Link is up at 100 Mbps, full duplex

[11357.254372] tg3 0000:04:00.0 enp4s0: Flow control is on for TX and on for RX

[11359.238284] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

[11361.238296] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

[11363.238437] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

[11365.959362] chrome[26527]: segfault at 0 ip 00007f9569f86336 sp 00007ffd6a6f2770 error 4 in libnvidia-glcore.so.340.76[7f95687ef000+1e5e000]

```

I guess it isn't solved after all.  Although it just caused kwin to restart without forcing the whole x-server to restart as it did last time.  I'm not sure if that is a real improvement or just coincidence.  Now it does show chrome and a nvidia segfault being the culprit so perhaps I need to check more upstream and maybe report a bug.  I will be checking to see if I can get it to happen when not running chrome.

----------

