# UDEV vs NVIDIA == unusable OpenCL

## tuskub

Hi guys,

3-4 weeks ago after performing a regular update I came across this bug #454740 or #667362 (maybe other bug reports related to the same problem exist).

The symptoms are well known: udev is eating 100% of one CPU thread (looks like /lib/udev/nvidia-udev.sh is calling nvidia-smi which hangs), X cannot start etc.

All of this and possible workarounds are described here:

https://bugs.gentoo.org/667362

https://bugs.gentoo.org/670340

https://forums.gentoo.org/viewtopic-t-1089100-start-0.html

https://forums.gentoo.org/viewtopic-p-8280144.html

https://forums.gentoo.org/viewtopic-t-1061632-start-0.html

https://forums.gentoo.org/viewtopic-t-1082884-start-0.html

What I have tried so far:

1. rc_parallel="NO" in rc.conf - didn't help

2. adding sleep 5 to nvidia-udev.sh - didn't help

3. commenting out nvidia-smi call in nvidia-udev.sh  - didn't help

4. installing different versions of udev and kernel - didn't make a difference

5. adding modules="nvidia" to /etc/conf.d/modules - does work, but gives me an error during a boot "modprobe: ERROR: could not insert 'nvidia' : Module already in kernel". Otherwise udev doesn't go into an infinite loop, X starts etc.

6. blacklisting all nvidia modules in /etc/modprobe.d/blacklist.conf - does work

So 5 or 6 look like viable workarounds but there is always one big "BUT": I've noticed that OpenCL is dead slow, really dead slow.

It takes clinfo roughly 8 seconds to print out the information about available OpenCL platforms. Same 8 seconds it takes Blender to display the Preferences / System dialog. And same 8 seconds SideFX Houdini spends to display it's About dialog. Both Blender and Houdini are querying OpenCL platforms the same way clinfo is doing so no wonder the time is similar.

In addition to that the AMD Radeon ProRender is super slow in Houdini. It takes about 24-32 seconds between the "render" button press and first pixel being rendered and after that during the rendering the mouse gets very sluggish, it is not even micro freezes but rather "macro".

The hardware I'm talking about is AMD Threadripper 1950X + Nvidia GTX 1070 Ti.

Software: kernels 5.4.38, 5.6.14, OpenRC 0.42.1, UDEV 243-r2 (I've tried 245, no changes), nvidia-drivers-440.82-r3.

nvidia-drivers get compiled with USE="uvm" which is needed for CUDA and OpenCL.

When I do USE="-uvm" the problem with OpenCL disappears (as well as OpenCL platform). So it is not a solution.

I have discovered that udev is doing this on any OpenCL discovery:

```
$ udevadm monitor

monitor will print the received events for:

UDEV - the event which udev sends out after rule processing

KERNEL - the kernel uevent

KERNEL[4284.510040] add      /kernel/slab/pma_address_batch (slab)

KERNEL[4284.510079] add      /kernel/slab/uvm_gpu_chunk_4 (slab)

KERNEL[4284.510097] add      /kernel/slab/uvm_gpu_chunk_5 (slab)

KERNEL[4284.510112] add      /kernel/slab/uvm_gpu_chunk_t (slab)

UDEV  [4284.511197] add      /kernel/slab/pma_address_batch (slab)

UDEV  [4284.511434] add      /kernel/slab/uvm_gpu_chunk_4 (slab)

UDEV  [4284.511600] add      /kernel/slab/uvm_gpu_chunk_5 (slab)

UDEV  [4284.511870] add      /kernel/slab/uvm_gpu_chunk_t (slab)

KERNEL[4290.252124] remove   /kernel/slab/pma_address_batch (slab)

KERNEL[4290.252145] remove   /kernel/slab/uvm_gpu_chunk_t (slab)

KERNEL[4290.252153] remove   /kernel/slab/uvm_gpu_chunk_4 (slab)

KERNEL[4290.252182] remove   /kernel/slab/uvm_gpu_chunk_5 (slab)

UDEV  [4290.253195] remove   /kernel/slab/pma_address_batch (slab)

UDEV  [4290.253262] remove   /kernel/slab/uvm_gpu_chunk_t (slab)

UDEV  [4290.253560] remove   /kernel/slab/uvm_gpu_chunk_4 (slab)

UDEV  [4290.253820] remove   /kernel/slab/uvm_gpu_chunk_5 (slab)
```

Now compare all what I said above with my another Gentoo machine which is Intel Xeon X5675 + NVidia 1050.

I can see that clinfo, Blender's Preferences / System dialog, Houdini's About dialog take a fraction of second.

AMD Radeon ProRender takes 8 sec (versus 24-32 sec on way faster machine) to show the first pixel, render is interactive, fast, the mouse is not lagging, I can move windows, nothing is freezing.

And what I see in udev monitor while querying OpenCL looks quite similar.

Looks like UDEV and NVIDIA don't play well in one case making OpenCL totally unusable.

Any ideas what can it be and what to dig next? I hope there is a solution. 

Thank you.

----------

## bluescream?

I am currently in a similar situation (although I do not use or need CUDA, so no uvm USE-Flag set ).

I am using an:

- AMD FX-8350

- NVIDIA GTX 1070

- gentoo-sources-5.4.38

- nvidia-drivers-440.82-r3

After some system updates a few days ago, my system has problems to boot up properly.

Before all this, it used gentoo-sources-5.4.28 and all was working fine.

Now it does not matter which kernel I boot, I have the same issue regardless of the kernel version.

What I could find out so far:

If I leave it as it is, the system boots, but:

- system clock cannot be restored from hardware clock

- no keyboard input possible

- login via ssh works

- xdm does not start

- module nvidia has been loaded

- modprobe nvidia-drm does not work (it simply hangs, but can be cancelled with CTRL+C)

- shutdown / reboot hangs; sometimes at umounting /home ... other times at remounting / readonly

- one cpu core has 100% system load

If I compress the nvidia-modules in /lib/modules/5.4.38/video (e.g. with bzip2), the system boots up and:

- system clock can be set

- keyboard input works, local login is possible

- login via ssh works

- no nvidia driver is loaded

- xdm does not start

- uncompress previously compressed nvidia modules

- modprobe nvidia works, but nothing else happens

- modprobe nvidia-drm works, and xdm starts immediately

- no 100% system load on a single cpu core

- shutdown works properly

The reason behind this is currently still unknown to me.

Fixes and workarounds from previous posts and bug reports have not worked so far.

PS: I just tried your fifth point (edit /etc/conf.d/modules and add nvidia) and it works here as well (as well with an error message, but at least the systems starts correctly with active keyboard and starting xdm).

----------

## JohnBlbec

my workstation has the same symptoms and i have to work in terminal :-( i haven't found a solution since 02/2019 when i spent much money for a new pc (intel core i9 skylake x, 64 GB ram, microsemi adaptec smartraid 3154-16i single, zotac geforce gtx 1060 3 GB amp, ...).

it makes me crazy :-(

----------

## JohnBlbec

i've noticed that all bugs according to nvidia are assigned to David Seifert. Correct me if i'm wrong but I have never seen any comment from David there. There is something wrong with that process in gentoo. could anybody from gentoo team please have a look at it because the problem still persists, unfortunately. thank you.

----------

## Ionen

 *JohnBlbec wrote:*   

> i've noticed that all bugs according to nvidia are assigned to David Seifert. Correct me if i'm wrong but I have never seen any comment from David there. There is something wrong with that process in gentoo. could anybody from gentoo team please have a look at it because the problem still persists, unfortunately. thank you.

 New maintainer and most of these bugs are from before became maintainer, been cleaning up things slowly (notably removed most of the old cruft from nvidia drivers and did the latest bumps which I helped test a bit for 390.xx given I still have a old box to test that). Also handles many of the sci-related cuda stuff, and that's likely why picked up nvidia-drivers when previous maintainer retired. Does comment now and then if need to know something, but not particularly chatty on bugzilla like many.

Personally haven't been able to reproduce any issues with udev and nvidia/uvm, but not that I did much research into this one (not saying the issue doesn't exist).

----------

## JohnBlbec

hi @ionen,

that's great to know you're alive and active working on those issues.

i can provide you what you'll want according to that issue as i'm able to reproduce it in 99% cases.

 *Ionen wrote:*   

>  *JohnBlbec wrote:*   i've noticed that all bugs according to nvidia are assigned to David Seifert. Correct me if i'm wrong but I have never seen any comment from David there. There is something wrong with that process in gentoo. could anybody from gentoo team please have a look at it because the problem still persists, unfortunately. thank you. New maintainer and most of these bugs are from before became maintainer, been cleaning up things slowly (notably removed most of the old cruft from nvidia drivers and did the latest bumps which I helped test a bit for 390.xx given I still have a old box to test that). Also handles many of the sci-related cuda stuff, and that's likely why picked up nvidia-drivers when previous maintainer retired. Does comment now and then if need to know something, but not particularly chatty on bugzilla like many.
> 
> Personally haven't been able to reproduce any issues with udev and nvidia/uvm, but not that I did much research into this one (not saying the issue doesn't exist).

 

----------

## Napalm Llama

I think I have this same issue.

- One udev process eating 100% of one CPU

- Can't kill that udev process

- Can't unload nvidia module (apparently because the above udev process is using it)

- Can't load nvidia-drm module (rmmod hangs, but can Ctrl-C)

- Can't remount filesystems readonly on shutdown (presumably because the udev process can't be killed)

I don't have the keyboard or RTC issues that bluescream encountered.  Blacklisting the nvidia module at the kernel command line seems to stop the problem.  Occasionally I boot and everything Just Works, for no apparent reason - but on next boot it'll be back to the above situation.

There's a recently closed bug that seems to match these problems.  Do we need to reopen it?  I can also help out if needs be :)

I started my own thread before finding this one.  Should I mark it as a dupe of this?

----------

## Ionen

nvidia-drivers ebuild doesn't use udev rules anymore (and nvidia-udev.sh doesn't exist), it should be a different issue (and either way likely something nvidia-drivers ebuild can't do anything about given drivers don't interact with udev and instead use nvidia-modprobe to create devices and load modules, it would also work on a system without udev at all).

...unless you're using ancient no-longer-in-tree nvidia-drivers or an ebuild from some random overlay

It's possible something else is conflicting though, not that I have any idea outright.

Edit: if all else fail and you're using eudev, I suggest to try sys-fs/udev or systemd, I don't really trust sys-fs/eudev to keep up with fixes that may already exist

----------

## Napalm Llama

Thanks for the tip Ionen, I'll give regular udev a go and see if it helps.  This seems like a good opportunity to help squash a bug in eudev though, who's a good person to talk to about that?

----------

## hotstoast

Hey everyone,

Had this problem too for ~3 months. Only just found this thread from Napalm Llama's other thread. Unfortunately due to some family issues and work I have just had to abandon my machine where it's at. Don't know if anyone has found the fix as yet but I am happy to help resolve this where I can.

Swapping from sys-fs/eudev to sys-fs/udev seemed to solve the slow udev thing, but not anything with Xorg or remounting fs shutdown hang.

Really appreciate everyone working on this! Let me know if anyone finds anything that resolves this!

----------

## Napalm Llama

Sorry for the delay - this problem is making my Gentoo install unusable, so I've been using that other OS... because despite its shortcomings it works.

I had another go yesterday, and everything just worked.  Happy days.  But reboot, and back to the same problem.  X doesn't start, one unkillable udev thread pegging a CPU core at 100%.  Can't shut down cleanly because the thread is that unkillable.  So as suggested, emerge -C eudev && emerge -1 sys-fs/udev.  Reboot.  Exactly the same problem - except the zombie thread is now a binary belonging to udev rather than eudev.  I've since swapped back to eudev.

So the problem is clearly related to udev - but either external to it, or shared between both variants.  One more thing.  I mentioned previously that the problem is intermittent, and only started happening after I moved my root partition from a hard disk drive to a fast nvme.  To me that screams race condition: something (nvidia-related?) is loading faster than the devs expected it to, and that leads to the behaviour observed.

Next step: can someone cleverer/more knowledgeable than me suggest a way to slow individual things down to try and identify the race condition?

----------

## Hu

Is udev in user mode or kernel mode?  Can you attach a debugger to it and collect a backtrace?  I expect not, if it is hung so hard that it cannot be killed.  Unkillable processes usually indicate a kernel mode bug, either the core kernel or nVidia.  Does your system work if you do not use the proprietary nVidia drivers?

----------

## hotstoast

I decided to remove the proprietary drivers to see what happens. After cleaning up the kernel and removing them completely, the problem goes away. So for the meantime I have been using Nouveau which works fine without the issues, with a slight performance hit. So my system works fine when the proprietary drivers are gone.

So far that is as close to a solution, that I have found.

----------

## Napalm Llama

From my original thread:

 *Quote:*   

> blacklisting the nvidia module from the kernel cmdline in grub seems to fix the problem (except still no X, of course)

 

It's an interaction between nvidia and udev.  I should add that when the problem happens, only the main nvidia module gets loaded - no nvidia-drm, etc.  Don't know if that detail is useful...

----------

## Napalm Llama

Sorry to double-post.

It seems to me that if blacklisting the nvidia module on the kernel command line stops udev from misbehaving (which it does) then if we can only load the nvidia module after udev has finished setting up, it might fix the issue - or would at least shed some more light on the situation.

The only way I've found to stop the nvidia module from getting loaded is via the kernel command line, which suggests that it's being pulled in by the kernel early on, rather than by userspace.  The trouble is if I blacklist it that way I can't seem to unblacklist it again to load it later.  Is there a way of doing that?

I'm trying all I can think of here, but I could really use some guidance from someone who knows more about udev...

[edit]

Thought this might provide some insight.  lsof -p of the offending udev process.

```
COMMAND PID USER   FD      TYPE             DEVICE SIZE/OFF    NODE NAME

udevd   862 root  cwd       DIR              259,3     4096       2 /

udevd   862 root  rtd       DIR              259,3     4096       2 /

udevd   862 root  txt       REG              259,3   335664   11564 /sbin/udevd

udevd   862 root  mem       REG              259,3 41679776   15857 /lib/modules/5.12.12-gentoo-splig-2-ipmi/video/nvidia.ko

udevd   862 root  mem       REG              259,3   550960 3020476 /lib/modules/5.12.12-gentoo-splig-2-ipmi/kernel/drivers/gpu/drm/drm_kms_helper.ko

udevd   862 root  mem       REG              259,3   996984 3020475 /lib/modules/5.12.12-gentoo-splig-2-ipmi/kernel/drivers/gpu/drm/drm.ko

udevd   862 root  mem       REG              259,3    43208 3049082 /lib64/libnss_files-2.33.so

udevd   862 root  mem       REG              259,3     7424 3020621 /lib/modules/5.12.12-gentoo-splig-2-ipmi/kernel/drivers/video/fbdev/core/syscopyarea.ko

udevd   862 root  mem       REG              259,3    12552 3020622 /lib/modules/5.12.12-gentoo-splig-2-ipmi/kernel/drivers/video/fbdev/core/sysfillrect.ko

udevd   862 root  mem       REG              259,3     6952 3020623 /lib/modules/5.12.12-gentoo-splig-2-ipmi/kernel/drivers/video/fbdev/core/sysimgblt.ko

udevd   862 root  mem       REG              259,3     7328 3020620 /lib/modules/5.12.12-gentoo-splig-2-ipmi/kernel/drivers/video/fbdev/core/fb_sys_fops.ko

udevd   862 root  mem       REG              259,3    97416 3020540 /lib/modules/5.12.12-gentoo-splig-2-ipmi/kernel/drivers/media/cec/core/cec.ko

udevd   862 root  mem       REG              259,3    14440 3020477 /lib/modules/5.12.12-gentoo-splig-2-ipmi/kernel/drivers/gpu/drm/drm_panel_orientation_quirks.ko

udevd   862 root  mem       REG              259,3    81373 3020723 /lib/modules/5.12.12-gentoo-splig-2-ipmi/modules.symbols.bin

udevd   862 root  mem       REG              259,3  9135404 2228280 /etc/udev/hwdb.bin

udevd   862 root  mem       REG              259,3   148488 3049975 /lib64/libpthread-2.33.so

udevd   862 root  mem       REG              259,3   100592 4110002 /lib64/libz.so.1.2.11

udevd   862 root  mem       REG              259,3   158192 4109238 /lib64/liblzma.so.5.2.5

udevd   862 root  mem       REG              259,3  1814608 3052293 /lib64/libc-2.33.so

udevd   862 root  mem       REG              259,3    97000 1310830 /lib64/libkmod.so.2.3.6

udevd   862 root  mem       REG              259,3   332464    2140 /lib64/libblkid.so.1.1.0

udevd   862 root  mem       REG              259,3    33664 3020617 /lib/modules/5.12.12-gentoo-splig-2-ipmi/kernel/drivers/video/backlight/backlight.ko

udevd   862 root  mem       REG              259,3    30128 3020707 /lib/modules/5.12.12-gentoo-splig-2-ipmi/modules.builtin.bin

udevd   862 root  mem       REG              259,3    69409 3020786 /lib/modules/5.12.12-gentoo-splig-2-ipmi/modules.alias.bin

udevd   862 root  mem       REG              259,3    24717 3020780 /lib/modules/5.12.12-gentoo-splig-2-ipmi/modules.dep.bin

udevd   862 root  mem       REG              259,3   202144 3052443 /lib64/ld-2.33.so

udevd   862 root    0u      CHR                1,3      0t0       6 /dev/null

udevd   862 root    1u      CHR                1,3      0t0       6 /dev/null

udevd   862 root    2u      CHR                1,3      0t0       6 /dev/null

udevd   862 root    3w      CHR               1,11      0t0      12 /dev/kmsg

udevd   862 root    4u  a_inode               0,14        0   16144 [signalfd]

udevd   862 root    5u  a_inode               0,14        0   16144 [eventpoll:4,12]

udevd   862 root    6r      REG              259,3  9135404 2228280 /etc/udev/hwdb.bin

udevd   862 root    7r  a_inode               0,14        0   16144 inotify

udevd   862 root    8r      REG              259,3    33664 3020617 /lib/modules/5.12.12-gentoo-splig-2-ipmi/kernel/drivers/video/backlight/backlight.ko

udevd   862 root    9r      REG              259,3    14440 3020477 /lib/modules/5.12.12-gentoo-splig-2-ipmi/kernel/drivers/gpu/drm/drm_panel_orientation_quirks.ko

udevd   862 root   10u     unix 0x00000000fd40e3a0      0t0   25618 type=DGRAM 

udevd   862 root   11r      REG              259,3   996984 3020475 /lib/modules/5.12.12-gentoo-splig-2-ipmi/kernel/drivers/gpu/drm/drm.ko

udevd   862 root   12u  netlink                         0t0   25638 KOBJECT_UEVENT

udevd   862 root   13r      REG              259,3    97416 3020540 /lib/modules/5.12.12-gentoo-splig-2-ipmi/kernel/drivers/media/cec/core/cec.ko

udevd   862 root   14r      REG              259,3     7328 3020620 /lib/modules/5.12.12-gentoo-splig-2-ipmi/kernel/drivers/video/fbdev/core/fb_sys_fops.ko

udevd   862 root   15r      REG              259,3     6952 3020623 /lib/modules/5.12.12-gentoo-splig-2-ipmi/kernel/drivers/video/fbdev/core/sysimgblt.ko

udevd   862 root   16r      REG              259,3    12552 3020622 /lib/modules/5.12.12-gentoo-splig-2-ipmi/kernel/drivers/video/fbdev/core/sysfillrect.ko

udevd   862 root   17r      REG              259,3     7424 3020621 /lib/modules/5.12.12-gentoo-splig-2-ipmi/kernel/drivers/video/fbdev/core/syscopyarea.ko

udevd   862 root   18r      REG              259,3   550960 3020476 /lib/modules/5.12.12-gentoo-splig-2-ipmi/kernel/drivers/gpu/drm/drm_kms_helper.ko

udevd   862 root   19r      REG              259,3 41679776   15857 /lib/modules/5.12.12-gentoo-splig-2-ipmi/video/nvidia.ko
```

----------

