# Method to test crashing system?

## RayDude

Final Update:

Core 0 has issues. I have found a way (via a user at reddit) to disable CPU1 and CPU2 (both threads of core0) for all tasks, except hardware related ones. This seems to have brought back stability.

****************************************************************************************8

I have a three year old server that is hanging every few days overnight.

It's done it three times in the last week or so.

The strange thing is: it's not totally dead. The screen freezes and I can't switch to console, but I can log in remotely. When I do, I can't kill any application that's hung which includes X, kde, plasma. I can't even get it to reboot.

I have to hard reset or power cycle it.

This feels like a hardware issue to me. Does anyone have any tricks to figuring out if it's hardware or if I somehow botched the software so badly that X is hanging beyond kill -9?

I was hoping to put off upgrading this system until next year... I'm going to start looking for motherboard and CPU deals...

Thanks in advance.

----------

## NeddySeagoon

RayDude,

Can you read logs is get logs off it?

dmesg would be good.

What does smartctl -x say about the HDD?

Boot into a few cycles of memtest86

A fail does not always mean a RAM fail.

Take out half the RAM. Does it still hang.

Now try with only the half of the RAM that was out.

Put all the RAM back ... what happens now.

----------

## RayDude

Thanks Neddy!

 *NeddySeagoon wrote:*   

> RayDude,
> 
> Can you read logs is get logs off it?
> 
> dmesg would be good.
> ...

 

I did not think of this. So obvious. Next time it happens I will definitely check both.

 *NeddySeagoon wrote:*   

> What does smartctl -x say about the HDD?
> 
> 

 

I know the hard drives / ssd are okay. I keep tabs on them.

 *NeddySeagoon wrote:*   

> 
> 
> Boot into a few cycles of memtest86
> 
> A fail does not always mean a RAM fail.
> ...

 

I will consider this.

The first failures happened after I installed a new BIOS and attempted to run the memory faster. It worked great, until it didn't.

But then, I slowed it all back down to slower than stock (this is a Ryzen 5 1600) I put memory at 2133 and I've never left stock cpu frequency or voltage and it is still happening.

It makes me wonder if the BIOS upgrade broke something. I'll check out gigabyte has released another bios to fix this one...

Thanks again. I really appreciate you taking the time to respond.

----------

## RayDude

Update: there was a new bios released last month. It contained AGESA 1.0.0.6 update. I'm hoping that helps.

I'm leaving everything stock to see if it fails again.

I'm crossing my fingers...

----------

## RayDude

I updated the BIOS, which set everything back to BIOS defaults, I left it there.

I did an emerge -DNuq @world yesterday and things went south again. Again, X windows died. Black screen, no activity.

But I was able to login remotely and check the system. Here is the end of dmesg, keep in mind some of the machination around the nvidia drivers are me trying to get X to restart and failing.

```
[29781.707378] elogind-daemon[2003]: New session c17 of user man.

[29782.656892] elogind-daemon[2003]: Removed session c17.

[54302.920055] elogind-daemon[2003]: New session 6 of user XXXX.

[54311.100668] elogind-daemon[2003]: Removed session 6.

[54319.280085] elogind-daemon[2003]: New session 7 of user XXXX.

[54441.427526] TCP: request_sock_TCP: Possible SYN flooding on port 56190. Sending cookies.  Check SNMP counters.

[76583.096115] fuse: init (API version 7.31)

[84602.043020] elogind-daemon[2003]: New session 8 of user XXXX.

[86195.821445] udevd[805]: invalid key/value pair in file /lib/udev/rules.d/60-steam-input.rules on line 42, starting at character 82 ('u')

[86723.913499] elogind-daemon[2003]: Removed session 7.

[87027.491302] traps: ThreadPoolSingl[4325] trap int3 ip:563acaf0f594 sp:7fd05b513f50 error:0 in chrome (deleted)[563ac804b000+7bf1000]

[87028.245102] elogind-daemon[2003]: Removed session 3.

[87092.876276] elogind-daemon[2003]: New session 9 of user root.

[87094.733875] elogind-daemon[2003]: Removed session 9.

[87103.959788] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]

[87103.959886] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs

[87104.611948] elogind-daemon[2003]: New session 10 of user mythtv.

[87158.604771] elogind-daemon[2003]: Removed session 10.

[87171.388304] elogind-daemon[2003]: New session 11 of user root.

[87780.826265] [drm] [nvidia-drm] [GPU ID 0x00000800] Unloading driver

[87780.836313] nvidia-modeset: Unloading

[87780.845269] nvidia-nvlink: Unregistered the Nvlink Core, major device number 246

[87823.682060] nvidia-nvlink: Nvlink Core is being initialized, major device number 246

[87823.682484] nvidia 0000:08:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=io+mem

[87823.882322] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  450.66  Wed Aug 12 19:42:48 UTC 2020

[87824.142123] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]

[87824.142224] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs

[93754.509148] nvidia-nvlink: Unregistered the Nvlink Core, major device number 246
```

I ran /etc/intit.d/xdm stop and it says sddm stopped, but X didn't.

I tried to remove the nvidia drivers and the system wouldn't do it, even with modprobe -r -f because "module in use".

ps -ef | grep plasma showed that plasma was still running. I killed it.

ps -ef | grep kde (I think) showed that something of kde was still running and I killed it.

Then I could unload the nvidia modules.

Then I tried to start sddm (xdm) again and nothing. The nvidia driver didn't even load. I loaded it by hand and drm didn't load, only the nvidia module loaded and some of the output in dmesg is what it had to say...

Nothing I did could get video to recover.

But CTRL-ALT-F1 did get me to a working console.

I'm starting to think the video card or motherboard is going bad. I wonder if there's dust caked on the video card. I didn't study it closely the last time I had the case open. I should probably check. I wonder if things are overheating. Every since the thermal monitor for KDE stopped working I haven't been paying attention to the temps. I should set up a script and watch the next time I emerge @world...

I did something wacky in BIOS for the next experiment. I turned the PCIe ports down to PCIe Gen 1 to see if it makes a difference. I'll do an emerge @world next Friday and see what happens.

It's funny, I had a very similar problem in the system this system replaced a couple years ago and I'm pretty sure the video card died in very similar ways. I still have it, can't throw out a Geforce, I might need it in a pinch, but it's just cooking in the garage summer heat for the last several years.

Man I want this thing to survive until Zen 3 comes out and doesn't bust a wallet.

Thanks for listening. I'll keep posting status because it helps me organize my thoughts.

PS. I wonder if the syn flooding is a symptom of the crash...

Edit: I have a huge SHM for zoneminder on this machine. This feels like it might be a memory management issue that affects X and plasma. I've been thinking about doubling my ram to 32 GB. DRAM prices are in freefall at the moment, should bottom out by the end of the year, maybe first quarter as manufacturers scale production back. But for now DRAM and SSDs are getting cheaper by the week. But I hestiate to buy new RAM when a new system might want DDR5? I'll have to check to see if Zen 3 supports DDR5... I suspect not...

----------

## RayDude

I don't know if anyone can help me, but I got an oops.

```

[408991.171083] Xorg: page allocation failure: order:5, mode:0x40cc0(GFP_KERNEL|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0

[408991.171092] CPU: 0 PID: 3377 Comm: Xorg Tainted: P           O    T 5.8.8-gentoo #1

[408991.171094] Hardware name: Gigabyte Technology Co., Ltd. AB350M-D3H/AB350M-D3H-CF, BIOS F51c 07/02/2020

[408991.171095] Call Trace:

[408991.171104]  dump_stack+0x6d/0x90

[408991.171109]  warn_alloc.cold+0x74/0xdb

[408991.171113]  ? __alloc_pages_direct_compact+0x11d/0x140

[408991.171117]  __alloc_pages_slowpath.constprop.0+0xb53/0xb90

[408991.171121]  ? wake_up_q+0x90/0x90

[408991.171124]  ? prep_new_page+0xbd/0xc0

[408991.171127]  __alloc_pages_nodemask+0x210/0x240

[408991.171131]  kmalloc_order+0x1b/0x60

[408991.171148]  nvkms_alloc+0x1b/0xd0 [nvidia_modeset]

[408991.171168]  _nv002653kms+0x16/0x30 [nvidia_modeset]

[408991.171185]  ? _nv002759kms+0x66/0x1470 [nvidia_modeset]

[408991.171200]  ? nv_kthread_q_stop+0x17e0/0x2970 [nvidia_modeset]

[408991.171202]  ? __alloc_pages_nodemask+0x11b/0x240

[408991.171216]  ? nv_kthread_q_stop+0x1cf1/0x2970 [nvidia_modeset]

[408991.171219]  ? kmalloc_order+0x57/0x60

[408991.171232]  ? nv_kthread_q_stop+0x17e0/0x2970 [nvidia_modeset]

[408991.171245]  ? nvKmsIoctl+0x96/0x1d0 [nvidia_modeset]

[408991.171259]  ? nvkms_ioctl_common+0x36/0x160 [nvidia_modeset]

[408991.171273]  ? nvkms_ioctl_common+0x124/0x160 [nvidia_modeset]

[408991.171449]  ? nvidia_frontend_unlocked_ioctl+0x2f/0x40 [nvidia]

[408991.171452]  ? ksys_ioctl+0x82/0xc0

[408991.171454]  ? __x64_sys_ioctl+0x11/0x20

[408991.171457]  ? do_syscall_64+0x3e/0xb0

[408991.171460]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

[408991.171475] Mem-Info:

[408991.171482] active_anon:2641871 inactive_anon:301721 isolated_anon:0

                 active_file:492415 inactive_file:210014 isolated_file:0

                 unevictable:24 dirty:34 writeback:0

                 slab_reclaimable:142143 slab_unreclaimable:30998

                 mapped:1689062 shmem:1608888 pagetables:19906 bounce:0

                 free:176615 free_pcp:0 free_cma:0

[408991.171486] Node 0 active_anon:10567484kB inactive_anon:1206884kB active_file:1969660kB inactive_file:840056kB unevictable:96kB isolated(anon):0kB isolated(file):0kB mapped:6756248kB 

dirty:136kB writeback:0kB shmem:6435552kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB all_unreclaimable? no

[408991.171490] DMA free:15888kB min:64kB low:80kB high:96kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB p

resent:15972kB managed:15888kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB

[408991.171491] lowmem_reserve[]: 0 3468 15940 15940

[408991.171498] DMA32 free:611724kB min:14688kB low:18360kB high:22032kB reserved_highatomic:0KB active_anon:1428252kB inactive_anon:427880kB active_file:202688kB inactive_file:452276kB u

nevictable:0kB writepending:16kB present:3616964kB managed:3616964kB mlocked:0kB kernel_stack:4492kB pagetables:12700kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB

[408991.171498] lowmem_reserve[]: 0 0 12472 12472

[408991.171505] Normal free:78848kB min:52828kB low:66032kB high:79236kB reserved_highatomic:2048KB active_anon:9139232kB inactive_anon:779004kB active_file:1766972kB inactive_file:387780

kB unevictable:96kB writepending:120kB present:13094400kB managed:12776556kB mlocked:96kB kernel_stack:13828kB pagetables:66924kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB

[408991.171505] lowmem_reserve[]: 0 0 0 0

[408991.171507] DMA: 0*4kB 0*8kB 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15888kB

[408991.171517] DMA32: 26229*4kB (UME) 18227*8kB (UME) 17805*16kB (UME) 2039*32kB (UME) 188*64kB (UME) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 612892kB

[408991.171525] Normal: 5695*4kB (UMEH) 2019*8kB (UMEH) 2080*16kB (UMEH) 225*32kB (UMEH) 33*64kB (MEH) 3*128kB (H) 1*256kB (H) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 82164kB

[408991.171536] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB

[408991.171537] 2376843 total pagecache pages

[408991.171540] 65526 pages in swap cache

[408991.171541] Swap cache stats: add 2813577, delete 2748166, find 1460048/1950362

[408991.171542] Free swap  = 5640700kB

[408991.171542] Total swap = 8388604kB

[408991.171543] 4181834 pages RAM

[408991.171544] 0 pages HighMem/MovableOnly

[408991.171544] 79482 pages reserved

[408991.171553] BUG: unable to handle page fault for address: 0000000000007980

[408991.171557] #PF: supervisor read access in kernel mode

[408991.171559] #PF: error_code(0x0000) - not-present page

[408991.171561] PGD 0 P4D 0 

[408991.171564] Oops: 0000 [#1] PREEMPT SMP NOPTI

[408991.171568] CPU: 0 PID: 3377 Comm: Xorg Tainted: P           O    T 5.8.8-gentoo #1

[408991.171569] Hardware name: Gigabyte Technology Co., Ltd. AB350M-D3H/AB350M-D3H-CF, BIOS F51c 07/02/2020

[408991.171593] RIP: 0010:_nv002606kms+0x60/0x100 [nvidia_modeset]

[408991.171601] Code: eb 40 0f 1f 84 00 00 00 00 00 48 c7 03 00 00 00 00 c6 43 08 00 41 8b 86 d0 00 00 00 83 c5 01 48 81 c3 28 04 00 00 39 e8 76 18 <48> 8b 3b 48 85 ff 74 ea 80 7b 08 00 75 d2 e8 dd d2 ff ff eb cb 0f

[408991.171603] RSP: 0018:ffffb099811c3ce8 EFLAGS: 00010202

[408991.171606] RAX: 0000000000000004 RBX: 0000000000007980 RCX: 0000000000000004

[408991.171608] RDX: ffff98aceb7e9348 RSI: 0000000000007980 RDI: ffff98ace71d1008

[408991.171610] RBP: 0000000000000000 R08: 0000000000000200 R09: 0000000000000000

[408991.171611] R10: 0000000000000004 R11: 0000000000000004 R12: 0000000000007980

[408991.171613] R13: 0000000000007980 R14: ffff98ace71d1008 R15: 0000000000000001

[408991.171616] FS:  00007f9eaf52d8c0(0000) GS:ffff98ad0e800000(0000) knlGS:0000000000000000

[408991.171618] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

[408991.171620] CR2: 0000000000007980 CR3: 00000003ef2be000 CR4: 00000000003406f0

[408991.171622] Call Trace:

[408991.171641]  ? _nv002759kms+0x3ca/0x1470 [nvidia_modeset]

[408991.171655]  ? nv_kthread_q_stop+0x17e0/0x2970 [nvidia_modeset]

[408991.171660]  ? __alloc_pages_nodemask+0x11b/0x240

[408991.171674]  ? nv_kthread_q_stop+0x1cf1/0x2970 [nvidia_modeset]

[408991.171678]  ? kmalloc_order+0x57/0x60

[408991.171693]  ? nv_kthread_q_stop+0x17e0/0x2970 [nvidia_modeset]

[408991.171708]  ? nvKmsIoctl+0x96/0x1d0 [nvidia_modeset]

[408991.171723]  ? nvkms_ioctl_common+0x36/0x160 [nvidia_modeset]

[408991.171738]  ? nvkms_ioctl_common+0x124/0x160 [nvidia_modeset]

[408991.171911]  ? nvidia_frontend_unlocked_ioctl+0x2f/0x40 [nvidia]

[408991.171915]  ? ksys_ioctl+0x82/0xc0

[408991.171918]  ? __x64_sys_ioctl+0x11/0x20

[408991.171921]  ? do_syscall_64+0x3e/0xb0

[408991.171925]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

[408991.171929] Modules linked in: ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter fuse nvidia_drm(PO) nvidia_modeset(PO) hid_logitech_hidpp nvidia(PO) input_leds hid_logitech_dj r8169 realtek libphy

[408991.171941] CR2: 0000000000007980

[408991.171944] ---[ end trace 816cbc84fb70ef20 ]---

[408991.171966] RIP: 0010:_nv002606kms+0x60/0x100 [nvidia_modeset]

[408991.171970] Code: eb 40 0f 1f 84 00 00 00 00 00 48 c7 03 00 00 00 00 c6 43 08 00 41 8b 86 d0 00 00 00 83 c5 01 48 81 c3 28 04 00 00 39 e8 76 18 <48> 8b 3b 48 85 ff 74 ea 80 7b 08 00 75 d2 e8 dd d2 ff ff eb cb 0f

[408991.171972] RSP: 0018:ffffb099811c3ce8 EFLAGS: 00010202

[408991.171974] RAX: 0000000000000004 RBX: 0000000000007980 RCX: 0000000000000004

[408991.171975] RDX: ffff98aceb7e9348 RSI: 0000000000007980 RDI: ffff98ace71d1008

[408991.171977] RBP: 0000000000000000 R08: 0000000000000200 R09: 0000000000000000

[408991.171978] R10: 0000000000000004 R11: 0000000000000004 R12: 0000000000007980

[408991.171980] R13: 0000000000007980 R14: ffff98ace71d1008 R15: 0000000000000001

[408991.171982] FS:  00007f9eaf52d8c0(0000) GS:ffff98ad0e800000(0000) knlGS:0000000000000000

[408991.171984] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

[408991.171986] CR2: 0000000000007980 CR3: 00000003ef2be000 CR4: 00000000003406f0

[409016.172216] GpuWatchdog[4352]: segfault at 0 ip 000055e936015a02 sp 00007f004d0e2850 error 6 in chrome[55e93177b000+7bf3000]

[409016.172225] Code: 89 de e8 c1 8e 6f ff 80 7d c7 00 79 09 48 8b 7d b0 e8 42 e9 6b fe 41 8b 84 24 e0 00 00 00 89 45 b0 48 8d 7d b0 e8 ce df 9c fb <c7> 04 25 00 00 00 00 37 13 00 00 48 83 c4 48 5b 41 5c 41 5d 41 5e

```

Can someone understand this? I'll pour through it after I get the PC rebooted.

----------

## RayDude

I found a thread on the internet that implies that the first error message:

```
Xorg: page allocation failure: order:5, mode:0x40cc0(GFP_KERNEL|__GFP_COMP),
```

Is caused by the nvidia driver crashing. If that's the case, then maybe an old driver would fix it, or perhaps the video card really is dying.

This crash didn't even happen under stress. I had just woken up the display from blanking when this crash happened.

The fact that reloading the driver didn't fix the problem before makes me think this is a hardware failure...

Dangit.

----------

## RayDude

Update: While emerging world today I had multiple oops.

But in this case, the video did not crash.

Which makes me think it's not the video card after all...

Since it seems to be related to memory and I'm using a lot of SHM for zoneminder, I decided to turn the shm size down since I was only using about 52% of it. I adjusted it from 12 GB to 8GB.

I'll let this run for a while and see if things continue to crash.

This may be a memory issue. It could be the motherboard...

----------

## NeddySeagoon

RayDude,

Memtest86 may be your friend. You must boot into it.

----------

## molletts

The video hang may be completely unrelated to the oops - it sounds very much like the occasional hangs I've been having for some months on my #2 system (also AMD-based, but much older than yours - a Phenom II X6 1090T), which I described here. It has an Nvidia GTX260 with the 340.x drivers.

I've yet to experience one on my #1 system (AMD FX9590), which has a GTX460 with the 390.x drivers.

----------

## RayDude

Thanks guys. 

Neddy, I'll try memtest86 soon.

----------

## RayDude

Hi Neddy,

I ran memtest86+

It hangs at 35%. Or at least I think it does. The keyboard can't do anything. I tried pressing F1 at the beginning but it doesn't seem to make a difference.

The memory bandwidth is accurately reported. I tested 2133 MHz and 3200 MHz and the bandwidth when from 14 GB/s to 18 GB/s.

I'm not sure what it's supposed to look like when it runs so I'm going to install it on my laptop and observe it...

Edit: I tried it on two older intel laptops and it goes to a blank screen and doesn't run...

Am I doing something wrong?

----------

## s|mon

Hi RayDude,

memtest usually continues along with different patterns so it would not really end iirc. But if you would manage to pass a day or two chances would be good that it is not broken.

But it should not hang (it may take longer to progress with later cycles but it should not hang) so that would be one more point hinting at sth. mem (or board/cpu) related.

You can find screenshots and description on it's website or wikipedia to see what it should look like.

Did you try as Neddy suggested with one module at a time? Maybe only one has an issue. I'd start with a single one on default (safe settings no OC) and let it run for a day or at least over night.

----------

## NeddySeagoon

RayDude,

It should either report errors and attempt to continue or there should be visible sights of progress at the top of the screen.

It works by dividing RAM into sections and testing a section at a time.

It moves itself around in RAM too, so it can test the region it was once running from, rather like playing 'core wars' with itself.

Try the  Added enhanced Fail Safe Mode (Press F1 at startup) option.

Blank screen at boot may be a BIOS/UEFI issue.

That it was running and stalled at 35% sounds like a problem.

Some tests take much longer than others but even on my Phenom II there is only a few seconds between scree updates.

----------

## RayDude

Thanks guys.

The machine under test is always in use. It's our home server. It serves files to the family, runs security cameras, records from live TV. It's quite busy.

I hesitate to start monkeying with it.

While I was trying to get memtest86+ to run, I changed a few BIOS settings. Then after I boot back into gentoo, chrome would crash if I scrolled down to the bottom of this thread.

Performing a BIOS set optimized defaults fixed that issue.

It had been stable for so many years. Sigh.

I'll try pulling a memory out and try memtest86+ again.

I tried memtest86+ on my new work laptop and it crashes at a black screen as well.

I can't imagine what is wrong with that...

----------

## RayDude

My motherboard sucks.

I can't use a USB keyboard to control the test. It doesn't work. CTRL-ALT-DEL works, but nothing else.

I used a PS2 keyboard and realized a few things.

1. With one ram (tried in three slots) it hangs at 65%.

2. With two rams, in either set of slots) it hangs at 35%.

3. I ran the test on all cores and the test gets to pass two and crashes at 5% on CORE0. It did this twice.

This makes me think that CORE 0 is bad. Is there a way to disable core0? I'm googling that next. 10 threads would be plenty until Zen3 comes out.

I'm also going to find out what the warranty is on this processor. I doubt it's five years, but hey, maybe.

----------

## RayDude

Argh. Can't disable core0 in linux...

----------

## Tony0945

 *RayDude wrote:*   

> 1. With one ram (tried in three slots) it hangs at 65%.
> 
> 2. With two rams, in either set of slots) it hangs at 35%.

 

Usually one stick can only go in one slot and two sticks must go in a particular pair of slots. Check your motherboard manual. if you can't find it, they are usually available for dowbload on the OEM's site.

----------

## RayDude

I tried all sorts of BIOS options.

I tried 4 cores with SMT off, two cores with SMT off.

I disabled advanced power control, cache, etc.

I turned the clock frequency down to 2 GHz, then 1.6 GHz.

No matter what I did, memtest86+ failed on core 0.

Core 0 is always active no matter how many cores you disable.

Assuming memtest86+ is good, the problem is core0.

What a bummer.

----------

## Tony0945

What socket? Good chance of getting a better CPU cheap.

----------

## NeddySeagoon

RayDude,

I think I've seen disabling Core 0 in very new kernels. 

```
 

  ┌──────────────────────── Debug CPU0 hotplug ────────────────────────┐

  │ CONFIG_DEBUG_HOTPLUG_CPU0:                                         │  

  │                                                                    │  

  │ Enabling this option offlines CPU0 (if CPU0 can be offlined) as    │  

  │ soon as possible and boots up userspace with CPU0 offlined. User   │  

  │ can online CPU0 back after boot time.                              │  

  │                                                                    │  

  │ To debug CPU0 hotplug, you need to enable CPU0 offline/online      │  

  │ feature by either turning on CONFIG_BOOTPARAM_HOTPLUG_CPU0 during  │  

  │ compilation or giving cpu0_hotplug kernel parameter at boot.       │  

```

From that I understand that you can bring up the box normally with CPU0 offline if the CPU supports it.

CPU0 is always used to start the other CPUs, so its not useful to turn it off in the firmware.

-- edit --

Can you test the RAM in another system?

----------

## RayDude

 *Tony0945 wrote:*   

> What socket? Good chance of getting a better CPU cheap.

 

Socket AM4. I was hoping to hold off until Zen3 was out and a bit mature...

I received this Ryzen 5 1600 from AMD directly as the one I purchased was affected by the "linux" bug. I've contacted AMD tech support to see what kind of warranty it has...

*crosses fingers*

----------

## RayDude

[quote="NeddySeagoon"]RayDude,

I think I've seen disabling Core 0 in very new kernels. 

```
 

  ┌──────────────────────── Debug CPU0 hotplug ────────────────────────┐

  │ CONFIG_DEBUG_HOTPLUG_CPU0:                                         │  

  │                                                                    │  

  │ Enabling this option offlines CPU0 (if CPU0 can be offlined) as    │  

  │ soon as possible and boots up userspace with CPU0 offlined. User   │  

  │ can online CPU0 back after boot time.                              │  

  │                                                                    │  

  │ To debug CPU0 hotplug, you need to enable CPU0 offline/online      │  

  │ feature by either turning on CONFIG_BOOTPARAM_HOTPLUG_CPU0 during  │  

  │ compilation or giving cpu0_hotplug kernel parameter at boot.       │  

```

This is really cool. Let me see if I can figure out how to enable it. If nothing else, perhaps it will make the system stable for a while.

I enabled it in my kernel and will reboot momentarily. I'll post an update as soon as it's up to let you know if it was able to offline cpu0.

 *NeddySeagoon wrote:*   

> From that I understand that you can bring up the box normally with CPU0 offline if the CPU supports it.
> 
> CPU0 is always used to start the other CPUs, so its not useful to turn it off in the firmware.
> 
> -- edit --
> ...

 

My son's gaming PC is a Ryzen 5 3600+ we got him last year. He's using it for school, but perhaps tonight I'll be able to swap the RAMs and see how they do.

They are both 3600 MHz memory, although mine is Samsung B-Die, so they should perform better in his system than his.

If memtest86+ works on my system with his ram then we have at least narrowed it down.

Thanks again for the great ideas!

----------

## RayDude

Ah well. It looks like CPU0 can't be off-lined.

It is interesting to note that when I offline CPU1 (second thread of CPU0, I think), htop still shows activity on that CPU. So I wonder if offlining really works at all.

[update: htop is accurate, one core has completely disappeared. I saw the top CPU was 11 and forgot that htop numbers them from 1 to 12]

I tried the (c)onfiguration options of memtest86+ to see if there was a way to ignore cpu0, but there wasn't. I ran another test and this time a few seconds into the test the machine rebooted.

I'll try swapping my son's memory with mine this afternoon. Once he's out of school.

Oh, and if the issue is third level CCX cache, then all the CPUs on the first CCX will have problems. And it seems like the system is using all 4 of the CPUs on that CCX, so I 'd need to disable them all.

If however the issue is with primary or secondary cache, then disabling a single CPU (and it's second thread) might solve the problem...

----------

## NeddySeagoon

RayDude,

You will loose your mind :)

At least you have a testbed, that makes life easier.

Here's a nasty thought, or a series of related nasty thoughts ...

The low voltages (below 3.3v) required to operate the RAM and CPU are derived from the Auxillary 12v connector to the motherboard,

Its 4, 6 or 8 pins, whatever, often not enough  pins. As a result, it gets hot and goes high resistance. Its always the sockets on the cable that suffer, as the pins on the motherboard are soldered to 2oz copper power planes.

All is well until there is a big gulp of power and one or more low voltage supplies goes out of tolerance.

A lot of the 12v is dropped across the connector ...

A worked example may help. To keep the arithmetic simple, lets say that the CPU needs 120W flat out.

That's 10A through that connector. That's OK when its nice and shiny.

Now suppose that the contract resistance increases from very little to 0.1 Ohm. That's still not a lot but it costs 1v loss at the connector.

10A and 1v is 100w ...  OK, so the connector fails well before the contact resistance gets to 0.1 Ohm.  :)

Long story short. I've had a few instances like this, its worth 'wiping' the contacts by unplugging and replugging the connector two or three times.

Make sure the gold is still there while its apart.

'Wiping' the pins will fix it for 6 to 9 months.

Further down in the converter, the input and output capacitors get a very hard life. They used to be aluminium electrolytics and failures we easy to spot.

The tops would dome, the rubber bungs would push out of the bottom and allow the electrolyte to leak out.

I don't know what you have on your motherboard, but a failure or one or more of these has the same effect.

Here's a test. The idea is to reduce the load on the 12v Aux.

In the BIOS, turn off as many real CPU cores as you dare, then run memtest86. 

If this works, it points the finger at the motherboard dynamic voltage regulation but the main PSU, providing the 12v is not in the clear yet either.

----------

## Tony0945

 *RayDude wrote:*   

> 
> 
> Socket AM4. I was hoping to hold off until Zen3 was out and a bit mature...
> 
> I received this Ryzen 5 1600 from AMD directly as the one I purchased was affected by the "linux" bug. I've contacted AMD tech support to see what kind of warranty it has...
> ...

 

I used an Athlon X4 950 for a year to bypass those early Ryzen problems. I see them cheap on internet. They are Bulldozer based so you would have to reinstall gcc, binutils, libtool and glibc from a stage3 and rebuild the system. A quick search shows even dual core Ryzen 3's going for ridiculous prices.  I do see used 2700X's from $100 to $200 and new ones at $200.  I have a 2700X and I like it a lot.

What mobo?   Also, these boards are notoriously fussy about memory.  Good luck with the warranty but my guess is that first generation Ryzen problems are not confined to a narrow range of production dates.

----------

## RayDude

 *Tony0945 wrote:*   

>  *RayDude wrote:*   
> 
> Socket AM4. I was hoping to hold off until Zen3 was out and a bit mature...
> 
> I received this Ryzen 5 1600 from AMD directly as the one I purchased was affected by the "linux" bug. I've contacted AMD tech support to see what kind of warranty it has...
> ...

 

I have a Gigabyte AB350M-D3H. I got really lucky during that first year. Gigabyte released a BIOS that over voltaged the parts and actually damaged peoples CPUs. I'm so glad I missed that update.

The reason I don't think it's bad DRAM is simply that it fails at 3200 MHz DDR4 in exactly the same way it fails at 2133 MHz. You would expect if it were flaky ram, the failure would happen less at slower speeds.

I've been watching the prices on the 2700 and 2700x, but I'd rather not dump more than $100.00 an an older processor, when I could apply it to a newer one.

Thanks again.

----------

## RayDude

 *NeddySeagoon wrote:*   

> RayDude,
> 
> You will loose your mind 
> 
> At least you have a testbed, that makes life easier.
> ...

 

Cool! And thanks so much!

I'm a EE and I totally get what you are saying. I had thought about power supply issues, but I hadn't thought about the 12V connector oxidizing. I'll check it out.

I have already tested memory with only two cores active / no hyperthreading. It hangs just as fast as it did with all 6 cores (12 threads) active.

I've already informed my son that I'm borrowing his ram for a quick test. He's not happy, but hopefully I'll do the test today.

I predict his memory will fail just as fast as mine did and show that it's not memory.

The only way to prove it's the motherboard is to buy a new CPU... I hope to hold that off.

----------

## RayDude

Update: My son's DDR4 fails memtest86+ in exactly the same way as my DDR4.

I left the memory swapped to see if it changes system stability.

I'm still disabling core1 for the heck of it.

I'll do emerge -DNuq @world later this week to see if it crashes again.

I'm sure it will.

----------

## Tony0945

I've heard Gigabyte mobo's are not the best for Ryzen. I have an MSI Tomahawk Arctic and everyone praises ASUS.

So. Motherboard (especially with BIOS update), CPU or memory. If the memory works in another machine, it should be good. Another Zen machine that is. I have read that memory that works in Intel machines may not work in Ryzen machines. I bought my memory direct from Crucial, guaranteed by them to work in my particular motherboard.  The machine I'm writing this on has a Gigabyte motherboard and a k10, Phenom II CPU. Different generation.

----------

## RayDude

 *Tony0945 wrote:*   

> I've heard Gigabyte mobo's are not the best for Ryzen. I have an MSI Tomahawk Arctic and everyone praises ASUS.
> 
> So. Motherboard (especially with BIOS update), CPU or memory. If the memory works in another machine, it should be good. Another Zen machine that is. I have read that memory that works in Intel machines may not work in Ryzen machines. I bought my memory direct from Crucial, guaranteed by them to work in my particular motherboard.  The machine I'm writing this on has a Gigabyte motherboard and a k10, Phenom II CPU. Different generation.

 

Thanks.

I won't buy gigabyte again.

We got an MSI for my son. It's been great. He's running 3600 MHz memory, no problems.

I'll have to do research again for my new board. I'm still trying to understand how the B550 boards are more expensive than the X570s...

----------

## Tony0945

 *RayDude wrote:*   

> I'll have to do research again for my new board. I'm still trying to understand how the B550 boards are more expensive than the X570s...

 

I bought an Asus (mainly on availability) and a 3900X for a new build but haven't assembled it yet. It's an X570. NeddySeagoon convinced me that i might need the extra lanes someday. It's a new expensive build. I'm hoping it will last a dozen years like this one. And this one may live on as a Gentoo router yet.

----------

## NeddySeagoon

RayDude,

Can you try your CPU in your sons system?

or even his CPU in your system?

Don't even think of a temporary swap without doing the CPU repaste job properly.

If turning off cores hasn't helped, its unlikely to be PSU.

It may be the DRAM controller, which is a corner of the CPU these days.

Your son must still be at an age where you can tell him these things.

You will soon need to negotiate :)

----------

## Tony0945

 *NeddySeagoon wrote:*   

> You will soon need to negotiate 

 

OH YES!!!

----------

## RayDude

I tested my son's memory. In fact, I'm running his memory at the moment. mine is labeled CAS17 but I think runs CAS16, his is labeled CAS18 but run CAS17 according to the XMP profile.

His computer is running with my memory, no issues (although it's windows so, yeah not so tough).

I can't see running my CPU in his PC as being an option, that's too much of a tear down of both machines. I don't want to make things worse. His is working, I'm going to leave it that way.

From what I can determine, X570 motherboards don't support Zen 1. They all specifically mention 2XXX and 3XXX, but not 1XXX Ryzen processors. That sucks.

That means that if I buy a new X570 MOBO to replace the Gigabyte, then I have to get a new processor to boot. That's what I'm trying to avoid.

So that leaves B450 as the only option and that's not good because it won't support 4XXX or 5XXX new features, although they will -- theoretically -- work.

I have a case opened with AMD, once I hear from them I'll have to figure out what to do.

----------

## RayDude

Oh. Have you guys done emerge -j n (where n > 3) while MAKEOPTS has "-j m" where m is number of threads?

I did that on my server and build crashed I seem to remember it going downhill quickly after that. I wonder if that isn't what toasted it... I know it's impossible for software to kill hardware (except in the case of the 6502 'halt and catch fire' instruction), but still it makes me wonder.

----------

## NeddySeagoon

RayDude,

I run MAKEOPTS="-j100" emerge --jobs=6 ... on a 96 core arm64 system.

It does get a bit sluggish when fifefox, thunderbird, libreoffice and chromium decide to build concurrently but it does not crash or lock up.  

Its just unlucky when that happens. :)

----------

## RayDude

 *NeddySeagoon wrote:*   

> RayDude,
> 
> I run MAKEOPTS="-j100" emerge --jobs=6 ... on a 96 core arm64 system.
> 
> It does get a bit sluggish when fifefox, thunderbird, libreoffice and chromium decide to build concurrently but it does not crash or lock up.  
> ...

 

Thanks. I think it failed on my new work laptop. Gosh I hope not. I'll try that next.

By the way I heard back from AMD and the Warranty is 3 years. They sent me that replacement CPU a couple months + three years ago.

But the tech support guy confirmed that CPU core 0 dying on the Address test of memtest86+ is a CPU problem and suggested I submit it for replacement.

I hope they'll throw me a bone.

I requested the RMA last night, I'll let you guys know what they say.

I'll update the thread

----------

## NeddySeagoon

RayDude,

If its a known systematic failure caused by AMD, like the packaging issue on very early Ryzen, they will probably give you a new CPU.

If its a random hardware failure, they probably won't.

Good luck.

----------

## RayDude

AMD says its out of warranty.

I'm out of luck.

Now what do I do? Limp along until Zen 3 ships and hope for a cheap Zen 2? Or get a used Zen+?

Ugh.

----------

## NeddySeagoon

RayDude,

Don't pin your hopes on Zen3 ... Remember Zen1 ... don't buy version 1 of anything.

As you have not done a motherboard or CPU swap, you don't know which it is.

That's really the next step,

Very long shot. A long time ago the kernel had a badram command line option. Maybe it was a patch.

The idea was to prevent the kernel allocating the badram, so everything worked as expected.

It won't matter to the kernel if the RAM is bad, or the RAM controller in the CPU has problems with some addresses.

You can also play with maxram= kernel command line option, to see if you can find a memory size that always works.

The idea remains to avoid triggering the problem.

One more thing. 

Remove your CPU from the motherboard and reseat it. It just might be a CPU pin to socket contact gone high resistance.

The address is applied to DRAMs in two pieces, called the column and row addresses. The DRAM and memory controller therefore have a property called 'geometry' which limits the size of DRAM that can be addressed.

The high part of the address is the column and the low part the row. The exact split is determined by the DRAM geometry.

If the row part was in a mess, almost nothing would work.

If the column address has a stuck bit (just one) then effects vary, from mapping two addresses to the same physical address, or mapping a real physical address to empty space.

That would generate a bus error, when the non existent RAM failed to respond.

The more I write, the more I think its CPU related as the row and column addresses travel over the same PCB tracks on the motherboard.

The upshot of this is that you can fit all your RAM and do a binary search with maxram=

Its not maxram, its mem=

----------

## RayDude

Thanks Neddy!

I posted a lament to reddit.com/r/pcmasterrace and received a suggestion from a linux guy...

I added this to the grub boot cmdline:

isolcpus=0,1

This keeps core0 and core1 (second thread of core 0) from receiving any programs.

It is strange though. Core0 still has kernel functions running on it periodically, but it only gets to 0.7% occupancy in htop, even when I did an emerge libreoffice.

This does appear to have cut down the amount of activity on core0. I'm hoping it will keep the system stable enough to wait until Zen3 drops, and make Zen+ or Zen2 cheaper so I can get 8 cores for $150.00 or less.

If I still get a crash, then I'll see if I can implement your suggestion. Figuring out where the bad ram is will take a bit of work. Hopefully dmesg will provide the addresses that fail and I'll be able to build a list of what is bad. If enough time passes I might be able to figure out which address bit / data bit is failing.

I'm really hoping it is a core0 issue though. That would be the best since this work around might mitigate that.

----------

## RayDude

I'm running emerge -DNuvq @world and this is what htop looks like:

https://i.imgur.com/wIQrpCU.png

It seems to be stable. I'll know as soon as it's done building the 110 packages.

----------

## NeddySeagoon

RayDude,

That will discriminate between a core 0 and RAM problem.

----------

## RayDude

I have come to the conclusion that core0 is used to talk to the hardware and that's why core0 needs to be active all the time. If that's true, then it explains why video was affected during the core0 crash.

I built llvm, qtwebengine, gimp and I'm building wine at the moment. It seems much more stable. Typically it would only get about half way through before crashing.

Although I have killed some hardware related background tasks, I am also building in an shm partition which has got to hit ram really hard.

I think this is a core0 problem... And that means, I'm safe for a while.

I wonder if the "disease" will travel from core0 to other cores.

Thanks again for your help!

----------

## RayDude

Update: it finished building without a problem. Looks like I'm set.

----------

## RayDude

It died. Video hang again...

It looks like a different CPU is on my purchase list... Bummer...

```
[115504.102790] usb 3-3: new full-speed USB device number 5 using xhci_hcd

[115504.237097] usb 3-3: New USB device found, idVendor=0a5c, idProduct=21e8, bcdDevice= 1.12

[115504.237101] usb 3-3: New USB device strings: Mfr=1, Product=2, SerialNumber=3

[115504.237103] usb 3-3: Product: BCM20702A0

[115504.237105] usb 3-3: Manufacturer: Broadcom Corp

[115504.237107] usb 3-3: SerialNumber: 001986002CBC

[115504.362033] Bluetooth: hci0: BCM: chip id 63

[115504.363032] Bluetooth: hci0: BCM: features 0x07

[115504.379042] Bluetooth: hci0: BCM20702A

[115504.379048] Bluetooth: hci0: BCM20702A1 (001.002.014) build 0000

[115504.380992] Bluetooth: hci0: BCM: firmware Patch file not found, tried:

[115504.380994] Bluetooth: hci0: BCM: 'brcm/BCM20702A1-0a5c-21e8.hcd'

[115504.380995] Bluetooth: hci0: BCM: 'brcm/BCM-0a5c-21e8.hcd'

[130116.593009] elogind-daemon[1990]: New session c18 of user man.

[130129.705431] elogind-daemon[1990]: Removed session c18.

[148094.899392] usb 3-1: USB disconnect, device number 2

[182577.681757] Xorg: page allocation failure: order:5, mode:0x40cc0(GFP_KERNEL|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0

[182577.681764] CPU: 3 PID: 3483 Comm: Xorg Tainted: P           O    T 5.8.12-gentoo #1

[182577.681765] Hardware name: Gigabyte Technology Co., Ltd. AB350M-D3H/AB350M-D3H-CF, BIOS F51c 07/02/2020

[182577.681766] Call Trace:

[182577.681773]  dump_stack+0x6d/0x90

[182577.681776]  warn_alloc.cold+0x74/0xd8

[182577.681779]  ? __alloc_pages_direct_compact+0x10f/0x130

[182577.681781]  __alloc_pages_slowpath.constprop.0+0xb69/0xba0

[182577.681783]  ? prep_new_page+0xbb/0xc0

[182577.681927]  ? _nv037032rm+0x26e/0x370 [nvidia]

[182577.681929]  __alloc_pages_nodemask+0x214/0x240

[182577.681932]  kmalloc_order+0x18/0x60

[182577.681943]  nvkms_alloc+0x1b/0xd0 [nvidia_modeset]

[182577.681957]  _nv002653kms+0x16/0x30 [nvidia_modeset]

[182577.681969]  ? _nv002759kms+0x66/0x1470 [nvidia_modeset]

[182577.682159]  ? _nv033594rm+0x40/0x40 [nvidia]

[182577.682351]  ? _nv000586rm+0xa08/0xde0 [nvidia]

[182577.682360]  ? nv_kthread_q_stop+0x17e0/0x2970 [nvidia_modeset]

[182577.682362]  ? __alloc_pages_nodemask+0x11f/0x240

[182577.682372]  ? nv_kthread_q_stop+0x1cf1/0x2970 [nvidia_modeset]

[182577.682373]  ? kmalloc_order+0x54/0x60

[182577.682382]  ? nv_kthread_q_stop+0x17e0/0x2970 [nvidia_modeset]

[182577.682392]  ? nvKmsIoctl+0x96/0x1d0 [nvidia_modeset]

[182577.682401]  ? nvkms_ioctl_common+0x36/0x160 [nvidia_modeset]

[182577.682410]  ? nvkms_ioctl_common+0x124/0x160 [nvidia_modeset]

[182577.682529]  ? nvidia_frontend_unlocked_ioctl+0x31/0x40 [nvidia]

[182577.682531]  ? ksys_ioctl+0x82/0xc0

[182577.682532]  ? __x64_sys_ioctl+0x11/0x20

[182577.682534]  ? do_syscall_64+0x3e/0xb0

[182577.682537]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

[182577.682547] Mem-Info:

[182577.682552] active_anon:2760986 inactive_anon:294392 isolated_anon:0

                 active_file:424716 inactive_file:237483 isolated_file:0

                 unevictable:24 dirty:222 writeback:0

                 slab_reclaimable:141336 slab_unreclaimable:32381

                 mapped:1741344 shmem:1626059 pagetables:20730 bounce:0

                 free:109346 free_pcp:344 free_cma:0

[182577.682555] Node 0 active_anon:11043944kB inactive_anon:1177568kB active_file:1698864kB inactive_file:949932kB unevictable:96kB isolated(anon):0kB isolated(file):0kB mapped:6965376kB dirty:888kB writeback:0kB shmem:6504236kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB all_unreclaimable? no

[182577.682558] DMA free:15888kB min:64kB low:80kB high:96kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15972kB managed:15888kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB

[182577.682559] lowmem_reserve[]: 0 3468 15940 15940

[182577.682563] DMA32 free:353412kB min:14688kB low:18360kB high:22032kB reserved_highatomic:0KB active_anon:1308636kB inactive_anon:575740kB active_file:297760kB inactive_file:545704kB unevictable:0kB writepending:360kB present:3616964kB managed:3616964kB mlocked:0kB kernel_stack:5440kB pagetables:15372kB bounce:0kB free_pcp:52kB local_pcp:52kB free_cma:0kB

[182577.682564] lowmem_reserve[]: 0 0 12472 12472

[182577.682568] Normal free:68084kB min:52828kB low:66032kB high:79236kB reserved_highatomic:2048KB active_anon:9735308kB inactive_anon:601828kB active_file:1401372kB inactive_file:404140kB unevictable:96kB writepending:528kB present:13094400kB managed:12776572kB mlocked:96kB kernel_stack:14768kB pagetables:67548kB bounce:0kB free_pcp:1324kB local_pcp:204kB free_cma:0kB

[182577.682568] lowmem_reserve[]: 0 0 0 0

[182577.682570] DMA: 0*4kB 0*8kB 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15888kB

[182577.682577] DMA32: 12856*4kB (UME) 9283*8kB (UME) 10180*16kB (UME) 1797*32kB (UME) 132*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 354520kB

[182577.682585] Normal: 1003*4kB (UMEH) 1445*8kB (UMEH) 2458*16kB (UMEH) 337*32kB (UMEH) 32*64kB (UMEH) 3*128kB (H) 1*256kB (H) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 68372kB

[182577.682592] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB

[182577.682592] 2324105 total pagecache pages

[182577.682595] 35776 pages in swap cache

[182577.682596] Swap cache stats: add 1432955, delete 1397288, find 349174/573369

[182577.682597] Free swap  = 6231548kB

[182577.682597] Total swap = 8388604kB

[182577.682598] 4181834 pages RAM

[182577.682598] 0 pages HighMem/MovableOnly

[182577.682599] 79478 pages reserved

[182577.682605] BUG: unable to handle page fault for address: 0000000000007980

[182577.682608] #PF: supervisor read access in kernel mode

[182577.682609] #PF: error_code(0x0000) - not-present page

[182577.682610] PGD 0 P4D 0 

[182577.682613] Oops: 0000 [#1] PREEMPT SMP NOPTI

[182577.682615] CPU: 3 PID: 3483 Comm: Xorg Tainted: P           O    T 5.8.12-gentoo #1

[182577.682616] Hardware name: Gigabyte Technology Co., Ltd. AB350M-D3H/AB350M-D3H-CF, BIOS F51c 07/02/2020

[182577.682632] RIP: 0010:_nv002606kms+0x60/0x100 [nvidia_modeset]

[182577.682635] Code: eb 40 0f 1f 84 00 00 00 00 00 48 c7 03 00 00 00 00 c6 43 08 00 41 8b 86 d0 00 00 00 83 c5 01 48 81 c3 28 04 00 00 39 e8 76 18 <48> 8b 3b 48 85 ff 74 ea 80 7b 08 00 75 d2 e8 dd d2 ff ff eb cb 0f

[182577.682636] RSP: 0018:ffffaf0f81a63ce8 EFLAGS: 00010202

[182577.682639] RAX: 0000000000000004 RBX: 0000000000007980 RCX: 0000000000000004

[182577.682640] RDX: ffff9adc238fe348 RSI: 0000000000007980 RDI: ffff9adc238f9008

[182577.682641] RBP: 0000000000000000 R08: 0000000000000200 R09: 0000000000000000

[182577.682643] R10: 0000000000000004 R11: 0000000000000004 R12: 0000000000007980

[182577.682644] R13: 0000000000007980 R14: ffff9adc238f9008 R15: 0000000000000001

[182577.682646] FS:  00007f7a5de388c0(0000) GS:ffff9adc4e980000(0000) knlGS:0000000000000000

[182577.682648] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

[182577.682649] CR2: 0000000000007980 CR3: 00000003e6498000 CR4: 00000000003406e0

[182577.682651] Call Trace:

[182577.682664]  ? _nv002759kms+0x3ca/0x1470 [nvidia_modeset]

[182577.682856]  ? _nv000586rm+0x970/0xde0 [nvidia]

[182577.682867]  ? nv_kthread_q_stop+0x17e0/0x2970 [nvidia_modeset]

[182577.682870]  ? __alloc_pages_nodemask+0x11f/0x240

[182577.682880]  ? nv_kthread_q_stop+0x1cf1/0x2970 [nvidia_modeset]

[182577.682883]  ? kmalloc_order+0x54/0x60

[182577.682893]  ? nv_kthread_q_stop+0x17e0/0x2970 [nvidia_modeset]

[182577.682902]  ? nvKmsIoctl+0x96/0x1d0 [nvidia_modeset]

[182577.682912]  ? nvkms_ioctl_common+0x36/0x160 [nvidia_modeset]

[182577.682922]  ? nvkms_ioctl_common+0x124/0x160 [nvidia_modeset]

[182577.683041]  ? nvidia_frontend_unlocked_ioctl+0x31/0x40 [nvidia]

[182577.683044]  ? ksys_ioctl+0x82/0xc0

[182577.683046]  ? __x64_sys_ioctl+0x11/0x20

[182577.683048]  ? do_syscall_64+0x3e/0xb0

[182577.683050]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

[182577.683051] Modules linked in: fuse nvidia_drm(PO) hid_logitech_hidpp nvidia_modeset(PO) nvidia(PO) hid_logitech_dj input_leds r8169 realtek libphy

[182577.683060] CR2: 0000000000007980

[182577.683062] ---[ end trace eac861e1a55d63fd ]---

[182577.683077] RIP: 0010:_nv002606kms+0x60/0x100 [nvidia_modeset]

[182577.683080] Code: eb 40 0f 1f 84 00 00 00 00 00 48 c7 03 00 00 00 00 c6 43 08 00 41 8b 86 d0 00 00 00 83 c5 01 48 81 c3 28 04 00 00 39 e8 76 18 <48> 8b 3b 48 85 ff 74 ea 80 7b 08 00 75 d2 e8 dd d2 ff ff eb cb 0f

[182577.683081] RSP: 0018:ffffaf0f81a63ce8 EFLAGS: 00010202

[182577.683083] RAX: 0000000000000004 RBX: 0000000000007980 RCX: 0000000000000004

[182577.683084] RDX: ffff9adc238fe348 RSI: 0000000000007980 RDI: ffff9adc238f9008

[182577.683085] RBP: 0000000000000000 R08: 0000000000000200 R09: 0000000000000000

[182577.683086] R10: 0000000000000004 R11: 0000000000000004 R12: 0000000000007980

[182577.683087] R13: 0000000000007980 R14: ffff9adc238f9008 R15: 0000000000000001

[182577.683090] FS:  00007f7a5de388c0(0000) GS:ffff9adc4e980000(0000) knlGS:0000000000000000

[182577.683091] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

[182577.683092] CR2: 0000000000007980 CR3: 00000003e6498000 CR4: 00000000003406e0

```

----------

## RayDude

Googled dmesg's crash message and found this:

https://forums.developer.nvidia.com/t/440-48-02-random-x-org-lock-ups-due-to-kernel-module-crash/110995/8

This might be a bug in the nvidia driver.

I put in his work around and will try it for the next week or so. If it lasts, I'll re-enable my core0. Then if that works, I'll boost my memory clock up from 2133MHz.

*crosses fingers*

----------

## RayDude

Still crashing even with harddpms false in xorg.conf.

```
[361372.655844] Xorg: page allocation failure: order:5, mode:0x40cc0(GFP_KERNEL|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0

[361372.655854] CPU: 8 PID: 19219 Comm: Xorg Tainted: P           O    T 5.8.12-gentoo #1

[361372.655856] Hardware name: Gigabyte Technology Co., Ltd. AB350M-D3H/AB350M-D3H-CF, BIOS F51c 07/02/2020

[361372.655857] Call Trace:

[361372.655866]  dump_stack+0x6d/0x90

[361372.655871]  warn_alloc.cold+0x74/0xd8

[361372.655875]  ? __alloc_pages_direct_compact+0x10f/0x130

[361372.655879]  __alloc_pages_slowpath.constprop.0+0xb69/0xba0

[361372.655882]  ? prep_new_page+0xbb/0xc0

[361372.656141]  ? _nv037032rm+0x26e/0x370 [nvidia]

[361372.656144]  __alloc_pages_nodemask+0x214/0x240

[361372.656148]  kmalloc_order+0x18/0x60

[361372.656166]  nvkms_alloc+0x1b/0xd0 [nvidia_modeset]

[361372.656188]  _nv002653kms+0x16/0x30 [nvidia_modeset]

[361372.656208]  ? _nv002759kms+0x66/0x1470 [nvidia_modeset]

[361372.656519]  ? _nv033594rm+0x40/0x40 [nvidia]

[361372.656829]  ? _nv000586rm+0xa08/0xde0 [nvidia]

[361372.656846]  ? nv_kthread_q_stop+0x17e0/0x2970 [nvidia_modeset]

[361372.656849]  ? __alloc_pages_nodemask+0x11f/0x240

[361372.656866]  ? nv_kthread_q_stop+0x1cf1/0x2970 [nvidia_modeset]

[361372.656868]  ? kmalloc_order+0x54/0x60

[361372.656885]  ? nv_kthread_q_stop+0x17e0/0x2970 [nvidia_modeset]

[361372.656901]  ? nvKmsIoctl+0x96/0x1d0 [nvidia_modeset]

[361372.656904]  ? __fget_files+0x6c/0xa0

[361372.656922]  ? nvkms_ioctl_common+0x36/0x160 [nvidia_modeset]

[361372.656947]  ? nvkms_ioctl_common+0x124/0x160 [nvidia_modeset]

[361372.657193]  ? nvidia_frontend_unlocked_ioctl+0x31/0x40 [nvidia]

[361372.657197]  ? ksys_ioctl+0x82/0xc0

[361372.657199]  ? __x64_sys_ioctl+0x11/0x20

[361372.657202]  ? do_syscall_64+0x3e/0xb0

[361372.657206]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

[361372.657226] Mem-Info:

[361372.657235] active_anon:2237041 inactive_anon:534370 isolated_anon:0

                 active_file:257084 inactive_file:442193 isolated_file:0

                 unevictable:24 dirty:168 writeback:506

                 slab_reclaimable:156242 slab_unreclaimable:29802

                 mapped:1703731 shmem:1591765 pagetables:19447 bounce:0

                 free:338412 free_pcp:1 free_cma:0

[361372.657239] Node 0 active_anon:8948164kB inactive_anon:2137480kB active_file:1028336kB inactive_file:1768772kB unevictable:96kB isolated(anon):0kB isolated(file):0kB mapped:6814924kB dirty:672kB writeback:2024kB shmem:6367060kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB all_unreclaimable? no

[361372.657246] DMA free:15888kB min:64kB low:80kB high:96kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15972kB managed:15888kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB

[361372.657247] lowmem_reserve[]: 0 3468 15940 15940

[361372.657256] DMA32 free:1259540kB min:14688kB low:18360kB high:22032kB reserved_highatomic:2048KB active_anon:1315496kB inactive_anon:275800kB active_file:75188kB inactive_file:404068kB unevictable:0kB writepending:2384kB present:3616964kB managed:3616964kB mlocked:0kB kernel_stack:3428kB pagetables:9328kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB

[361372.657257] lowmem_reserve[]: 0 0 12472 12472

[361372.657265] Normal free:78220kB min:52828kB low:66032kB high:79236kB reserved_highatomic:2048KB active_anon:7632668kB inactive_anon:1861680kB active_file:953148kB inactive_file:1364704kB unevictable:96kB writepending:312kB present:13094400kB managed:12776572kB mlocked:96kB kernel_stack:16540kB pagetables:68460kB bounce:0kB free_pcp:4kB local_pcp:0kB free_cma:0kB

[361372.657265] lowmem_reserve[]: 0 0 0 0

[361372.657268] DMA: 0*4kB 0*8kB 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15888kB

[361372.657280] DMA32: 72588*4kB (UMEH) 66966*8kB (UMEH) 25917*16kB (UMEH) 499*32kB (UMH) 20*64kB (UMH) 1*128kB (H) 1*256kB (H) 1*512kB (H) 1*1024kB (H) 0*2048kB 0*4096kB = 1259920kB

[361372.657293] Normal: 1225*4kB (UMEH) 2003*8kB (UMEH) 3506*16kB (UMEH) 11*32kB (UEH) 1*64kB (H) 2*128kB (H) 2*256kB (H) 2*512kB (H) 0*1024kB 0*2048kB 0*4096kB = 79228kB

[361372.657307] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB

[361372.657308] 2331041 total pagecache pages

[361372.657312] 39972 pages in swap cache

[361372.657313] Swap cache stats: add 2999527, delete 2959784, find 1401773/1901494

[361372.657314] Free swap  = 6158844kB

[361372.657316] Total swap = 8388604kB

[361372.657317] 4181834 pages RAM

[361372.657317] 0 pages HighMem/MovableOnly

[361372.657318] 79478 pages reserved

[361372.657329] BUG: unable to handle page fault for address: 0000000000007980

[361372.657334] #PF: supervisor read access in kernel mode

[361372.657337] #PF: error_code(0x0000) - not-present page

[361372.657339] PGD 0 P4D 0 

[361372.657344] Oops: 0000 [#1] PREEMPT SMP NOPTI

[361372.657348] CPU: 8 PID: 19219 Comm: Xorg Tainted: P           O    T 5.8.12-gentoo #1

[361372.657356] Hardware name: Gigabyte Technology Co., Ltd. AB350M-D3H/AB350M-D3H-CF, BIOS F51c 07/02/2020

[361372.657388] RIP: 0010:_nv002606kms+0x60/0x100 [nvidia_modeset]

[361372.657394] Code: eb 40 0f 1f 84 00 00 00 00 00 48 c7 03 00 00 00 00 c6 43 08 00 41 8b 86 d0 00 00 00 83 c5 01 48 81 c3 28 04 00 00 39 e8 76 18 <48> 8b 3b 48 85 ff 74 ea 80 7b 08 00 75 d2 e8 dd d2 ff ff eb cb 0f

[361372.657397] RSP: 0018:ffffb08680bbfce8 EFLAGS: 00010202

[361372.657400] RAX: 0000000000000004 RBX: 0000000000007980 RCX: 0000000000000004

[361372.657403] RDX: ffff9e79cab44348 RSI: 0000000000007980 RDI: ffff9e79cab41008

[361372.657405] RBP: 0000000000000000 R08: 0000000000000200 R09: 0000000000000000

[361372.657408] R10: 0000000000000004 R11: 0000000000000004 R12: 0000000000007980

[361372.657411] R13: 0000000000007980 R14: ffff9e79cab41008 R15: 0000000000000001

[361372.657414] FS:  00007f032f0158c0(0000) GS:ffff9e79cec00000(0000) knlGS:0000000000000000

[361372.657417] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

[361372.657419] CR2: 0000000000007980 CR3: 000000024565e000 CR4: 00000000003406e0

[361372.657422] Call Trace:

[361372.657450]  ? _nv002759kms+0x3ca/0x1470 [nvidia_modeset]

[361372.657770]  ? _nv000586rm+0x970/0xde0 [nvidia]

[361372.657788]  ? nv_kthread_q_stop+0x17e0/0x2970 [nvidia_modeset]

[361372.657794]  ? __alloc_pages_nodemask+0x11f/0x240

[361372.657811]  ? nv_kthread_q_stop+0x1cf1/0x2970 [nvidia_modeset]

[361372.657815]  ? kmalloc_order+0x54/0x60

[361372.657832]  ? nv_kthread_q_stop+0x17e0/0x2970 [nvidia_modeset]

[361372.657848]  ? nvKmsIoctl+0x96/0x1d0 [nvidia_modeset]

[361372.657852]  ? __fget_files+0x6c/0xa0

[361372.657869]  ? nvkms_ioctl_common+0x36/0x160 [nvidia_modeset]

[361372.657885]  ? nvkms_ioctl_common+0x124/0x160 [nvidia_modeset]

[361372.658097]  ? nvidia_frontend_unlocked_ioctl+0x31/0x40 [nvidia]

[361372.658104]  ? ksys_ioctl+0x82/0xc0

[361372.658106]  ? __x64_sys_ioctl+0x11/0x20

[361372.658110]  ? do_syscall_64+0x3e/0xb0

[361372.658114]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

[361372.658118] Modules linked in: fuse nvidia_drm(PO) nvidia_modeset(PO) hid_logitech_hidpp nvidia(PO) hid_logitech_dj input_leds r8169 realtek libphy

[361372.658132] CR2: 0000000000007980

[361372.658136] ---[ end trace 727de6fe850b9bc6 ]---

[361372.658161] RIP: 0010:_nv002606kms+0x60/0x100 [nvidia_modeset]

[361372.658166] Code: eb 40 0f 1f 84 00 00 00 00 00 48 c7 03 00 00 00 00 c6 43 08 00 41 8b 86 d0 00 00 00 83 c5 01 48 81 c3 28 04 00 00 39 e8 76 18 <48> 8b 3b 48 85 ff 74 ea 80 7b 08 00 75 d2 e8 dd d2 ff ff eb cb 0f

[361372.658168] RSP: 0018:ffffb08680bbfce8 EFLAGS: 00010202

[361372.658170] RAX: 0000000000000004 RBX: 0000000000007980 RCX: 0000000000000004

[361372.658172] RDX: ffff9e79cab44348 RSI: 0000000000007980 RDI: ffff9e79cab41008

[361372.658175] RBP: 0000000000000000 R08: 0000000000000200 R09: 0000000000000000

[361372.658177] R10: 0000000000000004 R11: 0000000000000004 R12: 0000000000007980

[361372.658178] R13: 0000000000007980 R14: ffff9e79cab41008 R15: 0000000000000001

[361372.658181] FS:  00007f032f0158c0(0000) GS:ffff9e79cec00000(0000) knlGS:0000000000000000

[361372.658183] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

[361372.658185] CR2: 0000000000007980 CR3: 000000024565e000 CR4: 00000000003406e0
```

Maybe the CPU is really dead...

----------

## Ionen

Are you using nvidia-drivers-455.23.04 now?

I've recently ran into page allocation failure issues as well, could easily trigger it on purpose by doing heavy tmpfs usage but other things can randomly trigger it. There's a newer similar report on nvidia's forums.

This also didn't hang the system, could ctrl+alt+F1 to return to my efifb console (don't even need ssh, xorg was still taking inputs) and everything was going as normal except the xorg display being frozen.

Returning to nvidia-drivers-450.80.02 solved the issue.

If it's happening to you with stable 450.66 then I don't know  :Sad: 

Edit: although I did see one other user have similar problems with 450.66, I'm not convinced it's failing hardware either way. 440.100 is probably the most stable if want to try it.

Edit2: personally still get page allocation failures if I use the new 455.28 (does need some unnatural abuse to make it happen quickly though, would mostly work otherwise), so went back to 450.80.02 again.

----------

## RayDude

 *Ionen wrote:*   

> Are you using nvidia-drivers-455.23.04 now?
> 
> I've recently ran into page allocation failure issues as well, could easily trigger it on purpose by doing heavy tmpfs usage but other things can randomly trigger it. There's a newer similar report on nvidia's forums.
> 
> This also didn't hang the system, could ctrl+alt+F1 to return to my efifb console (don't even need ssh, xorg was still taking inputs) and everything was going as normal except the xorg display being frozen.
> ...

 

Sorry I missed your post. Yes, I'm using the nvidia blob driver. I setup nouvaux but that was a mistake as it doesn't support my 4:2:0 4K monitor, so I went back to the blob. I have confirmed 450.66 failing.

I have isolated it to DPMS. I have a lot of SHM activity as the system is running zoneminder with three cameras. So that is in line with what you are seeing. There are also really bad memory leaks with linux on all my 4K systems. Man I wish someone would debug plasma. All my systems slowly increase memory usage until I have to reboot because chrome starts to get twitchy. I've spent a bit of time trying to figure out what is using all the memory, but that is hard to track for a novice like me. I'm going to update my server to 32GB as soon as memory prices stabilize (they are in free fall at the moment).

I'm pretty sure that the hang is happening when the NVIDIA driver attempts to contact the monitor to wake up from sleep and the monitor is off, but the receiver, which is supposed to be transparent to HDMI is on. It happens whether I use HW or SW DPMS in the X driver. I suspect that because the receiver is on and the monitor is still off that DPMS is getting confused and getting stuck in a loop which is missing a timeout escape.

I've adjusted the settings of my Yamaha receiver and have had a few days of it working, but I'm also aware of how I wake up the PC from monitor sleep. I think it's likely to fail when I try to wake it up as the receiver and TV are turning on.

I need the TV to sleep because my OLED will burn if I don't put it to sleep. I also solved the problem by turning off DPMS and using xscreensaver, but alas, when watching youtube and netflix, the screen saver turns on and I haven't found a way to keep that from happening.

I'm pretty sure that even though my CPU fails memtest86+, it is okay for normal loads. I'm running all cores and memory at 3200 CAS 17 without issue.

I was able to connect the PC directly to the TV and then funnel audio to my receiver through SPDIF, but there is an issue with the receiver not waking up to the SPDIF input and audio doesn't start for many seconds after video starts. It's mind bogglingly stupid and annoying. So I put everything back as it was.

Please keep me up to date on your progress and I'll keep updating this thread. This problem is fairly new (started a few months ago) so I'm pretty sure the blob changed and caused this issue. I have no idea how to get nvidia to debug it, though. My setup is pretty unique.

I might try 450.80.02 if I get another hang.

These kinds of bugs are the worst because they are rare and reproducing them in front of the appropriate software engineer is not easy. I'm hoping nvidia will pull a rabbit out of a hat and fix it.

For my next card, I might try AMD and see how their open source driver is doing. I've never run anything but intel and nvidia so that would be a first for me.

Thanks for posting.

----------

## NeddySeagoon

RayDude,

 *RayDude wrote:*   

> I was able to connect the PC directly to the TV and then funnel audio to my receiver through SPDIF, but there is an issue with the receiver not waking up to the SPDIF input and audio doesn't start for many seconds after video starts. It's mind bogglingly stupid and annoying. So I put everything back as it was. 

 

Have you tried continually playing silence? 

I have a /etc/local.d/play_silence.start that contains 

```
#!/bin/bash

# due to a kernel bug, alsa takes a few seconds to open a

# stream to HDMI (all digital?) outputs, so the first

# few seconds of everything are lost.

# the work around is to continuously play silence

# so that the sound stream is never closed.

aplay -c2 -r48000 -fS16_LE < /dev/zero & 
```

You may want to adjust the parameters to aplay. I'm only using stereo here.

----------

## molletts

 *NeddySeagoon wrote:*   

> Have you tried continually playing silence? 

 

I noticed while building a 5.9 kernel yesterday that it actually has an option to do this automatically (CONFIG_SND_HDA_INTEL_HDMI_SILENT_STREAM) which can be found in the HD-Audio drivers section of menuconfig.

Stephen

----------

## RayDude

 *NeddySeagoon wrote:*   

> RayDude,
> 
>  *RayDude wrote:*   I was able to connect the PC directly to the TV and then funnel audio to my receiver through SPDIF, but there is an issue with the receiver not waking up to the SPDIF input and audio doesn't start for many seconds after video starts. It's mind bogglingly stupid and annoying. So I put everything back as it was.  
> 
> Have you tried continually playing silence? 
> ...

 

Thanks for this. This is brilliant. I mean it's a horrible work around to a stupid bug, but still it's brilliant.

----------

## RayDude

 *molletts wrote:*   

>  *NeddySeagoon wrote:*   Have you tried continually playing silence?  
> 
> I noticed while building a 5.9 kernel yesterday that it actually has an option to do this automatically (CONFIG_SND_HDA_INTEL_HDMI_SILENT_STREAM) which can be found in the HD-Audio drivers section of menuconfig.
> 
> Stephen

 

I can't believe they built it into the kernel. Wow. A bug so universal that the kernel developers put a work around in the kernel. Amazing.

I'm going to turn this on so I never have to deal with that again...

Edit: Ah. 5.9 kernel... I haven't started using that one yet...

----------

## RayDude

A lot has happened.

I disabled all power states in BIOS. There were two options, but I can't remember them. I'm not sure if it helped. I am not going to enable it to find out at this point.

The biggest thing I found was that I had stopped running zenstates.py because I thought it was broken, but it isn't and because I wasn't disabling state C6, many crashes were happening.

I stopped using it because of these messages in dmesg:

```
[   19.993243] msr: Write to unrecognized MSR 0xc0010292 by python

               Please report to x86@kernel.org

[   19.993255] msr: Write to unrecognized MSR 0xc0010292 by python

               Please report to x86@kernel.org

[   19.993264] msr: Write to unrecognized MSR 0xc0010292 by python

               Please report to x86@kernel.org

[   19.993272] msr: Write to unrecognized MSR 0xc0010292 by python

               Please report to x86@kernel.org

[   19.993280] msr: Write to unrecognized MSR 0xc0010292 by python

               Please report to x86@kernel.org

[   19.993287] msr: Write to unrecognized MSR 0xc0010292 by python

               Please report to x86@kernel.org

[   19.993296] msr: Write to unrecognized MSR 0xc0010292 by python

               Please report to x86@kernel.org

[   19.993302] msr: Write to unrecognized MSR 0xc0010292 by python

               Please report to x86@kernel.org

[   19.993308] msr: Write to unrecognized MSR 0xc0010292 by python

               Please report to x86@kernel.org

[   19.993316] msr: Write to unrecognized MSR 0xc0010292 by python

               Please report to x86@kernel.org

```

I think this is the zenstates.py code attempting to change states for non-existent CPUs.

I've gotten better stability. It's been up for two days now and I think will last a lot longer. Hopefully I can use the computer until Zen 4 comes out next year or the year after.

I want to document what is needed to get a Zen 1 CPU stable on Gentoo and 5.9.X kernel:

First, when you boot linux you need to disable various features which may be buggy on your Zen 1.

linux boot command line:

```
server ~ # cat /proc/cmdline 

BOOT_IMAGE=/vmlinuz-5.9.6-gentoo root=/dev/nvme0n1p4 ro idle=nomwait rcu_nocbs=0-11 amd_iommu=on video=efifb:off kvm.ignore_msrs=1 pci=msi

```

Note: if you have more than 6 cores, you need to specify all of them in the rcu_nocbs option.

Second, you have to run zenstates.py every boot to disable power state C6:

```
server ~ # cat /etc/local.d/zenstates.start 

# This script turns off C6 states on my Ryzen 5-1600 to improve stability.

/sbin/modprobe msr

/usr/local/bin/zenstates.py --c6-disable

```

Last: you have to ensure that you aren't using the latest nvidia driver which has an issue with DPMS on HDMI:

dmesg shows this when the nvidia driver crashes:

```
[182577.681757] Xorg: page allocation failure: order:5, mode:0x40cc0(GFP_KERNEL|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0

[182577.681764] CPU: 3 PID: 3483 Comm: Xorg Tainted: P           O    T 5.8.12-gentoo #1

[182577.681765] Hardware name: Gigabyte Technology Co., Ltd. AB350M-D3H/AB350M-D3H-CF, BIOS F51c 07/02/2020

[182577.681766] Call Trace:

[182577.681773]  dump_stack+0x6d/0x90

[182577.681776]  warn_alloc.cold+0x74/0xd8

[182577.681779]  ? __alloc_pages_direct_compact+0x10f/0x130

[182577.681781]  __alloc_pages_slowpath.constprop.0+0xb69/0xba0

[182577.681783]  ? prep_new_page+0xbb/0xc0

[182577.681927]  ? _nv037032rm+0x26e/0x370 [nvidia]

... There is much more ...

```

Stop this error by going to an old nvidia-driver:

```
server ~ # equery list nvidia-drivers

 * Searching for nvidia-drivers ...

[IP-] [  ] x11-drivers/nvidia-drivers-450.80.02:0/450

server ~ # cat /etc/portage/package.mask/nvidia-drivers 

>x11-drivers/nvidia-drivers-450.80.02

```

Bonus: there are many memory leaks on my server. I know because after only two days I have 3.74 GB of swap used. I'm not sure what is leaking memory. I went looking and found that much of swap was in use by chrome, but much of it was related to kwin and plasma as well.

I'm fixing this by increasing my RAM from 16GB to 32GB.

I'll post again to let you know how long it lasts. Removing zenstates.pl was a huge mistake. I got a crash a day after C6 was enabled. But the crashes weren't always fatal. Just video would crash. I could get it back by switching to console and back to X three or four times.

----------

