# System hangs randomly but only when using amdgpu [solved]

## Amity88

This is a fresh Linux system, the screen randomly fills with a color (blue/white/yellow etc) and is rendered unusable till a restart. I'm not even sure if this is a hang because sometimes when this happens, I can still hear the audio from the youtube video.

It's weird as it never happens in Windows 8.1 but randomly hits me when I use Gentoo or SysRescueCD or SuSe or Mint of FreeBSD. In short it happens on any non-Windows OS.

I don't think it's a video issue because it happens even in pure CLI. The kern log or dmesg  file doesn't indicate any error at the time of the hang. Do you guys have any suggestions on what else I could check to fix this?

Here's the output of lspci, this is an ASUS H81M-CS motherboard:

```

00:00.0 Host bridge: Intel Corporation 4th Gen Core Processor DRAM Controller (rev 06)

00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor PCI Express x16 Controller (rev 06)

00:02.0 Display controller: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor Integrated Graphics Controller (rev 06)

00:14.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB xHCI (rev 05)

00:16.0 Communication controller: Intel Corporation 8 Series/C220 Series Chipset Family MEI Controller #1 (rev 04)

00:1a.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #2 (rev 05)

00:1b.0 Audio device: Intel Corporation 8 Series/C220 Series Chipset High Definition Audio Controller (rev 05)

00:1c.0 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #1 (rev d5)

00:1c.1 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #2 (rev d5)

00:1c.2 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #3 (rev d5)

00:1d.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #1 (rev 05)

00:1f.0 ISA bridge: Intel Corporation C220 Series Chipset Family H81 Express LPC Controller (rev 05)

00:1f.2 SATA controller: Intel Corporation 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode] (rev 05)

00:1f.3 SMBus: Intel Corporation 8 Series/C220 Series Chipset Family SMBus Controller (rev 05)

01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Oland XT [Radeon HD 8670 / R7 250/350]

01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Cape Verde/Pitcairn HDMI Audio [Radeon HD 7700/7800 Series]

03:00.0 Network controller: Qualcomm Atheros AR9485 Wireless Network Adapter (rev 01)

04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 11)

```

uname -an

```

Linux vivalarev 4.9.76-gentoo-r1 #3 SMP Sun Feb 11 18:08:13 IST 2018 x86_64 Intel(R) Core(TM) i5-4690 CPU @ 3.50GHz GenuineIntel GNU/Linux

```

----------

## Jaglover

Try swapping RAM modules if you have more than one. (Maybe Microsoft dream is fulfilled finally, a computer that runs Windows only.)

----------

## Amity88

I tried running memtest86 over the past 24 hours and didn't get any errors or blanks screens.

Currently, I suspect that it's the AMG GPU driver (amdgpu R7 250, Southern Islands GCN 1.0) that is causing the issue. For the purpose of debug, I'm gonna try the following:

1. Try using the older radeon driver and see if the issue persists.

2. If that doesn't work, I'll try using the onboard Intel GPU.

----------

## Amity88

(changing the subject to better reflect the actual issue)

So, I was able to narrow down the issue to the gprahics driver.

1. The screen blanks out randomly when I used AMDGPU drivers.

2. It's a lot worse when I used Radion.

3. The only thing that worked in the past was the old fglrx driver a  year ago. Can't use this anymore though cause they're dropped support  :Sad: 

4. The onboard Intel GPU driver is stable. This is what I'm using now.

Not sure how I can fix this. If you guys get the AMD R7 250 (Southern Islands) working without random hangs, please let me know.

----------

## Zucca

Do you get anything in dmesg/logs?

Have you tried other kernel versions?

I have

```
VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cape Verde PRO [Radeon HD 7750/8740 / R7 250E]
```

on my server. So I could start poking around. I just need to attach a monitor to it.  :Razz: 

----------

## Amity88

I didn't find anything in the logs/dmesg when booted into SysRescueCD after an incident.

About the other kernel version. This system used to work fine with the fglrx drivers. Things just got messy after AMD moved over to the amdgpu drivers.

Also, it's good to know that you actually have something close in design. I think mine is GCN 1.1 and your is probably GCN 1.2  :Smile: 

I haven't really started using this build so I'm willing to experiment if you have anything you want me to try..

----------

## NeddySeagoon

Amity88,

I have a similar issue with an R450.

I've tried different motherboard slots, the old and new amdgpu drivers, turning off Message Signalled IRQs. (its a command line option)

Memtest finds nothing and there is nothing in kernel logs.

The incident halts the CPU, as it won't even respond to the reset button, which is probably why there is nothing in the logs.

After a power cycle, the system often restarts with one core missing.

A restart can take a couple of hours too.  Its left my raid5 'dirty' a few times so it does a resync.

So far, I've only tried the two amdgpu drivers but a few other non accelerated drives should work.

That may help determine if its a hardware or a software problem.

----------

## Amity88

Hey there Neddy  :Smile: 

I've ruled out any hardware issues because I dual boot with Windows 8.1 and it runs pretty stable.

The symptoms on Linux are very similar to what you experience. Doesn't respond to reset key combinations, the actual reset button doesn't work at times. Restarts don't take much time though.

The non-accelerated drivers would do software rendering right? As crappy as it is, I figured that the Intel GPU is better than software rendering. Maybe I should try pulling in fglrx or amdgpu-pro

----------

## NeddySeagoon

Amity88,

Yes - there would be no acceleration at all.  I had in mind vesa or fbdev.

The GPU does nothing and the CPU does all the drawing.  Performance will be terrible.

I've gone back to my 9 year old nVidia card meanwhile as I need to address Meltdown/Spectre eveywhere and having random lock ups doesn't help.

----------

## thumper

Have you checked your logs for kernel crash dumps?

I had these:

```
amdgpu 0000:24:00.0: swiotlb buffer is full (sz: 2097152 bytes)

swiotlb: coherent allocation failed for device 0000:24:00.0 size=2097152

CPU: 0 PID: 5149 Comm: Compositor Tainted: G           OE    4.15.3-gentoo #1
```

And it would eventually hard lock the machine.

After some research I added this to my kernel command line:

```
swiotlb=65536
```

Did that last week, have not crashed since, still time will tell.  Could be a coincidence.

George

----------

## PrSo

thumper,

 those messages in log are totally harmless and shouldn't be the reason of hard locking, please see this bug report, and this patch on LKML, so  this _is_ a coincidence, although this could be a symptom.

Amity88,

is there any special reason that you are on 4.9.76-gentoo-r1 kernel?

----------

## Amity88

 *PrSo wrote:*   

> 
> 
> Amity88,
> 
> is there any special reason that you are on 4.9.76-gentoo-r1 kernel?

 

I just use this version because it was the latest stable kernel. Do you feel that a newer kernel would fix the problem?

----------

## Mimamau

As in my other thread, there seems to be problems with southern islands gpus.

I only get a slow 2d desktop, everything else gives me a blank screen or crashes the system completely.

Even the amdgpu-pro drivers don't work on supported distributions. AMD support wrote:

"I apologize for the delay. I was waiting for feedback from the subject matter experts.

Unfortunately, it appears the HD7870 series has not been qualified with our latest drivers.

The recommendation is to use the inbox drivers or an open source driver, available here: https://www.x.org/wiki/RadeonFeature/#index10h2

If you experience issues with the open source drivers, please file a report at the link above. I have been informed that our engineers monitor and investigate reports listed there.

In order to update this service request, please respond, leaving the service request reference intact.

Best regards,

AMD Global Customer Care"

----------

## NeddySeagoon

Amity88,

There is a new amdgpu driver is the 4.15 kernel.

Its worth a try.

----------

## Tony0945

 *Amity88 wrote:*   

> I just use this version because it was the latest stable kernel. Do you feel that a newer kernel would fix the problem?

 

4.9.82 is in the tree.

I have problems with 4.4.x and 4.9.x with motherboard module nct6775 failing to load. No problem with 4.14.x  Trying 'meld' on the relevant kernel source, I see that 4.4 and 4.9 are identical but 4.4 has tables with an extra entry. Undoubtedly that line supports my mobo which is a new AM4 mobo.

 *NeddySeagoon wrote:*   

> Amity88,
> 
> There is a new amdgpu driver is the 4.15 kernel.
> 
> Its worth a try.

   Based on Neddy's input, I would try 4.14 or 4.15 (has some Spectre mitigation) or, depending on your comfort level, try backporting the driver to 4.9.

I think I'll try that, just for fun.

EDIT backporting the driver worked fine. Couldn't find where in kernel.org to file a bug. I may just file a bug against gentoo-sourcesLast edited by Tony0945 on Tue Aug 14, 2018 12:52 am; edited 3 times in total

----------

## PrSo

 *Amity88 wrote:*   

> 
> 
> Do you feel that a newer kernel would fix the problem?

 

Just like Neddy sad, you should try 4.15.4. (change to ~amd64 or unmask gentoo-sources)

There is a big improvement with amdgpu driver, and a new AMD DC (but I am not sure if your card family -Oland- is supported, BTW SI=GCN 1.0)

I have one machine with GCN 1.1 (R4 APU - CIK) and this is the first mainline kernel (4.15) when things on amdgpu driver works quite good (it is still _experimental_ for SI and CIK tough).

One more thing, do you have dual gpu enabled, Intel and AMD?

```
00:02.0 Display controller: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor Integrated Graphics Controller (rev 06) 
```

----------

## NeddySeagoon

Having done the Spectre updates, I've gone back to my RX450 card.

As the 4.15 kernel didn't fix my lockups, I'm trying 4.16-rc1

Watch this space.

----------

## Zucca

 *NeddySeagoon wrote:*   

> Watch this space.

 I have stalled all kernel and amdgpu updates. Now waiting eagerly.

I really don't want my server to lock up. I have exactly one spare GPU and it is AMD HD 7850. I think it's affected too. And I think the current one on my server is too: Cape Verde PRO R7 250E

My desktop has Fiji Based R9 Nano... I think I'm safe there...

----------

## gcyoung

I am also getting intermittent screen and wireless keyboard freezes. While it works, the amdgpu module seems  better than the radeonsi. I don't know if it is connected, but my login dmesg output contains a message [[Firmware Bug:] ACPI MWAIT C-state 0x0 not supported by hw].

I note that the Arch web site also contains referenced to problems with the combination of amdgpu and ryzen processor.

I have ssh'd (without X) into the computer from another machine, and find it is still responding normally to commands.

It's a pity, since I like the result before it freezes!

PS: I am using kernel-4.17.6 which is not listed as stable, but I found the same problem with an earlier stable kernel

----------

## Goverp

This is probably no help, but I'm running an hp laptop with /proc/cpuinfo model name: "AMD A9-9420 RADEON R5, 5 COMPUTE CORES 2C+3G".  It's a STONEY graphics thingy.

It also has an rtl8723de modem, which meant I need a very later kernel (and an external module), so I've been running kernel 4.16 originally, 4.17.1 now.

Never had any problems like described in this thread, nor any issues from using a late kernel.

AFAIK (I read Phoronix summaries) AMDGPU support features regularly in the kernel change logs.

I currently have:

```
/etc/portage/make.conf

VIDEO_CARDS="amdgpu radeonsi"
```

and, to reduce kernel churn:

```
/etc/portage/package.keywords

<=sys-kernel/gentoo-sources-4.17.1 ~amd64

```

I read today that 4.18 has more AMDGPU stuff.

----------

## gcyoung

It may be of interest to others with this problem to know that since my last posting I have followed a suggestion given under the heading dpm' on  https://wiki.gentoo.org/wiki/AMDGPU#Hardware_detection.

 [echo performance > /sys/class/drm/card0/device/power_dpm_state] 

and:-

 [echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level]

Since making these settings I have had no further "freezes", except when I made only the first setting. Since making the settings I have used the computer, including  one five hour mythtv frontend performance, for about twenty hours. Previously, I failed regularly to complete a fairly standard viewing of a film --say about two hours, without needing a reboot.

Unfortunately  the settings disappear when I log out, although I suppose I can write a small  script to run these settings on login. If there is any way to include the settings as options to the module, I'd be glad to hear of it:-- or possibly there might be kernel setting which would do the trick.

If I don't return with a message that I've had another "freezup", then it can be assumed that these settings have solved, at least my difficulty, although it might not work in other cases.

----------

## Amity88

@gcyoung,

        You have solved it I think! This is the same solution that worked for me as well. I came back here to updated it. Basically we need to disable the dynamic power management (dpm) of this gpu.

----------

## Zucca

Thanks. Gotta poke those settings too.

Sadly, it looks like power consumption will increase. :|

----------

