# Issue with amdgpu card and powerplay since kernel update

## ZenoOfElea

I have noticed a recent issue with my amdgpu based R9 380 (Volcanic islands series) graphics card. I am not sure if this is caused by a mistaken kernel configuration or what exactly triggers the problem but during the boot process when the DRM  KMS is taking over from the legacy 80x24 framebuffer the systems freezes for 15 seconds or so and following is printed to the kernel message buffer.

```

[   27.574874] amdgpu: [powerplay] Failed to retrieve minimum clocks.

[   27.574875] amdgpu: [powerplay] Error in phm_get_clock_info 

[   27.575091] [drm] dce110_link_encoder_construct: Failed to get encoder_cap_info from VBIOS with error code 4!

[   27.575103] [drm] dce110_link_encoder_construct: Failed to get encoder_cap_info from VBIOS with error code 4!

[   27.575114] [drm] dce110_link_encoder_construct: Failed to get encoder_cap_info from VBIOS with error code 4!

[   27.575442] [drm] Display Core initialized with v3.1.59!

[   27.628472] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).

[   27.628473] [drm] Driver supports precise vblank timestamp query.

[   27.677800] [drm] UVD initialized successfully.

[   27.888890] [drm] VCE initialized successfully.

[   27.890434] [drm] fb mappable at 0xE0E25000

[   27.890435] [drm] vram apper at 0xE0000000

[   27.890436] [drm] size 8294400

[   27.890436] [drm] fb depth is 24

[   27.890437] [drm]    pitch is 7680

[   27.890574] fbcon: amdgpudrmfb (fb0) is primary device

[   28.031451] Console: switching to colour frame buffer device 240x67

[   28.053547] amdgpu 0000:02:00.0: fb0: amdgpudrmfb frame buffer device

[   28.410220] amdgpu: [powerplay] 

                failed to send message 5d ret is 0 

[   28.760175] amdgpu: [powerplay] 

                last message was failed ret is 0

[   29.110123] amdgpu: [powerplay] 

                failed to send message 148 ret is 0 

[   29.809993] amdgpu: [powerplay] 

                last message was failed ret is 0

[   30.159941] amdgpu: [powerplay] 

                failed to send message 145 ret is 0 

[   30.859815] amdgpu: [powerplay] 

                last message was failed ret is 0

[   31.209777] amdgpu: [powerplay] 

                failed to send message 146 ret is 0 

[   31.568264] amdgpu: [powerplay] 

                last message was failed ret is 0

[   31.914555] amdgpu: [powerplay] 

                last message was failed ret is 0

[   31.920737] amdgpu: [powerplay] 

                failed to send message 155 ret is 0 

[   32.267039] amdgpu: [powerplay] 

                failed to send message 260 ret is 0 

[   32.273256] amdgpu: [powerplay] 

                last message was failed ret is 0

[   32.625681] amdgpu: [powerplay] 

                failed to send message 15b ret is 0 

[   32.969465] amdgpu: [powerplay] 

                last message was failed ret is 0

[   33.319269] amdgpu: [powerplay] 

                failed to send message 260 ret is 0 

[   34.018608] amdgpu: [powerplay] 

                last message was failed ret is 0

[   34.368287] amdgpu: [powerplay] 

                failed to send message 260 ret is 0 

[   35.067659] amdgpu: [powerplay] 

                last message was failed ret is 0

[   35.417341] amdgpu: [powerplay] 

                failed to send message 260 ret is 0 

[   36.116695] amdgpu: [powerplay] 

                last message was failed ret is 0

[   36.466369] amdgpu: [powerplay] 

                failed to send message 260 ret is 0 

[   37.165719] amdgpu: [powerplay] 

                last message was failed ret is 0

[   37.515390] amdgpu: [powerplay] 

                failed to send message 260 ret is 0 

[   38.214738] amdgpu: [powerplay] 

                last message was failed ret is 0

[   38.564429] amdgpu: [powerplay] 

                failed to send message 260 ret is 0 

[   39.263779] amdgpu: [powerplay] 

                last message was failed ret is 0

[   39.613451] amdgpu: [powerplay] 

                failed to send message 260 ret is 0 

[   40.312770] amdgpu: [powerplay] 

                last message was failed ret is 0

[   40.662444] amdgpu: [powerplay] 

                failed to send message 260 ret is 0 

[   41.361799] amdgpu: [powerplay] 

                last message was failed ret is 0

[   41.711474] amdgpu: [powerplay] 

                failed to send message 260 ret is 0 

[   42.410827] amdgpu: [powerplay] 

                last message was failed ret is 0

[   42.760505] amdgpu: [powerplay] 

                failed to send message 260 ret is 0 

[   43.459853] amdgpu: [powerplay] 

                last message was failed ret is 0

[   43.809519] amdgpu: [powerplay] 

                failed to send message 260 ret is 0 

[   43.809611] [drm] Initialized amdgpu 3.27.0 20150101 for 0000:02:00.0 on minor 0

[   43.809974] [drm] Initialized i915 1.6.0 20180719 for 0000:00:02.0 on minor 1

[   43.825122] [drm] Cannot find any crtc or sizes

[   43.830205] [drm] Cannot find any crtc or sizes

[   43.835234] [drm] Cannot find any crtc or sizes

[   46.272965] amdgpu: [powerplay] 

                last message was failed ret is 0

[   46.653008] amdgpu: [powerplay] 

                failed to send message 154 ret is 0 

[   47.658171] [drm:amdgpu_uvd_ring_test_ib [amdgpu]] *ERROR* amdgpu: (0)IB test timed out.

[   47.658205] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* amdgpu: failed testing IB on ring 12 (-110).

[   48.209963] amdgpu: [powerplay] 

                last message was failed ret is 0

[   48.561668] amdgpu: [powerplay] 

                failed to send message 15a ret is 0 

[   48.561885] [drm:process_one_work] *ERROR* ib ring test failed (-110).

[   49.942961] amdgpu: [powerplay] 

                last message was failed ret is 0

[   50.292627] amdgpu: [powerplay] 

                failed to send message 15b ret is 0 

[   50.957807] amdgpu: [powerplay] 

                last message was failed ret is 0

[   51.307466] amdgpu: [powerplay] 

                failed to send message 155 ret is 0 

```

This only becomes an issue after the system boots when I run the sensors program found in the lm_sensors package.  The sensors program works but causes a temporary freeze spams the message buffer with:

```

[13835.186423] amdgpu: [powerplay] 

                last message was failed ret is 0

[13835.537600] amdgpu: [powerplay] 

                failed to send message 282 ret is 0 

[13835.888783] amdgpu: [powerplay] 

                last message was failed ret is 0

[13836.239905] amdgpu: [powerplay] 

                failed to send message 170 ret is 0 

[13836.591997] amdgpu: [powerplay] 

                last message was failed ret is 0

[13836.943289] amdgpu: [powerplay] 

                failed to send message 171 ret is 0 

[13837.295225] amdgpu: [powerplay] 

                last message was failed ret is 0

[13837.646192] amdgpu: [powerplay] 

                failed to send message 171 ret is 0 

[13837.998352] amdgpu: [powerplay] 

                last message was failed ret is 0

[13838.349518] amdgpu: [powerplay] 

                failed to send message 171 ret is 0 

[13838.701678] amdgpu: [powerplay] 

                last message was failed ret is 0

[13839.052848] amdgpu: [powerplay] 

                failed to send message 171 ret is 0 

[13839.404880] amdgpu: [powerplay] 

                last message was failed ret is 0

[13839.755661] amdgpu: [powerplay] 

                failed to send message 171 ret is 0 

[13840.107747] amdgpu: [powerplay] 

                last message was failed ret is 0

[13840.459014] amdgpu: [powerplay] 

                failed to send message 171 ret is 0 

[13840.811228] amdgpu: [powerplay] 

                last message was failed ret is 0

[13841.162384] amdgpu: [powerplay] 

                failed to send message 171 ret is 0 

[13841.514520] amdgpu: [powerplay] 

                last message was failed ret is 0

[13841.865626] amdgpu: [powerplay] 

                failed to send message 171 ret is 0 

[13842.217726] amdgpu: [powerplay] 

                last message was failed ret is 0

[13842.569209] amdgpu: [powerplay] 

                failed to send message 171 ret is 0 

[13842.921494] amdgpu: [powerplay] 

                last message was failed ret is 0

[13843.272727] amdgpu: [powerplay] 

                failed to send message 171 ret is 0 

```

I am at a loss of what I should do any to tackle this problem and suggestion or information would be greatly appreciated[/code]

----------

## davee

Hey,

I have had a similar issue with AMDGPU with my R9 290 (Sea Islands). While I don't have the same 15 second freeze on boot, I do the get the intermittent freezes during normal usage of my system. I have also narrowed this down to the lm_sensors package, and specifically the issue occurs after a failure to read the fan1 state.

```
# sensors -u

amdgpu-pci-0100

Adapter: PCI adapter

vddgfx:

  in0_input: 1.000

fan1:

ERROR: Can't get value of subfeature fan1_input: Can't read

temp1:

  temp1_input: 65.000

  temp1_crit: 104000.000

  temp1_crit_hyst: -273.150

power1:

  power1_average: 66.165

  power1_cap: 225.000

```

When this happens, I also get a similar powerplay error message in the amdgpu driver:

```
amdgpu: [powerplay] 

 failed to send message 282 ret is 254

```

While this error has always been displayed for me, the freezing issue has only appeared for me after updating kernel from 4.19.27 to 4.19.44. I am looking for a more precise cause for this, but so far I have not found anything. Did you manage to get any further with your issue?

----------

## Goverp

FWIW I too get a very annoying 15 sec freeze on booting.  Mine is a Radeon RX570.  Thanks for the hints about lm-sensors - I'll dig a little.  AFAIR I got rid of that package because my old AMD Phenom motherboard tells lies, rendering lm-sensors useless.

----------

## miiichael

Hi,

For the benefit of posters here, and googlers in general, here are my discoveries. R9 290 on Debian (shhh, don't tell anyone!). 4.19.0 AMD64 kernel.

Anyway, I've just noticed that when something touches /sys/class/hwmon/hwmon3/power1_average is the cause of the kernel error messages I get:

```
root@joyola:/home/michael# time cat "/sys/class/hwmon/hwmon3/power1_average";tail /var/log/kern.log|grep $(date +%T)

32140000

real    0m0.498s

user    0m0.000s

sys     0m0.497s

Jun 12 16:26:00 joyola kernel: [399556.316094] amdgpu: [powerplay]

Jun 12 16:26:00 joyola kernel: [399556.316094]  failed to send message 282 ret is 254

```

I found this out by strace'ing /usr/bin/sensors, which on my system is invoked half a dozen times every five minutes via munin-node.

This does confirm suspicions that this is a kernel issue (as opposed to the xorg driver, or other ancillary libraries, etc).

I can't comment on boot delays, as I don't really reboot often enough to be sure (plus I think my boot delay problems relate mostly to both eth0 and my ethernet over power waiting for the other to wake up before waking themselves up...).

Edited to add: BTW I have "radeon.cik_support=0 amdgpu.cik_support=1 radeon.si_support=0 amdgpu.si_support=1 amdgpu.dc_log=1 amdgpu.dc=0" set, if that matters.

----------

## TigerJr

 *miiichael wrote:*   

> Hi,
> 
> For the benefit of posters here, and googlers in general, here are my discoveries. R9 290 on Debian (shhh, don't tell anyone!). 4.19.0 AMD64 kernel.
> 
> Anyway, I've just noticed that when something touches /sys/class/hwmon/hwmon3/power1_average is the cause of the kernel error messages I get:
> ...

 

grep $(date +%T)  is not right for finding reasons of kernel messages, i think, but im shure error is in amdgpu kernel driver,

just try modern kernel 5.0.x  revision and if error repeats again post message here

cat /sys/class/hwmon/hwmon3/in0_input

Have you got same message if you get current voltage ?

----------

## Goverp

You may find my partial solution of interest.

----------

## MasterCATZ

"old cards and kernels dpm=1 enabled the new dpm; then when AMD power play came out, they swapped its definition and dpm=0 would select power play and dpm=1 would still select the old power management"

I wounder if this is part of the reason I randomly just loose fan control on my R9 290's 

I can not manually adjust anything .. and neither can the bios 

because everything started roasting because fans would randomly get forced to 20% by something I modded the bios with higher fan speeds , now I still see this situation even after I am blocked from accessing fans manually , nothing wrong with fans as when GPU hits 96 deg they do go 100% ..

----------

## MasterCATZ

anyone found a solution for this yet , its cluttering up my log files and a huge waste of space  just from this spam that is created every second .. 

5.3.11-050311-generic

and I need PSensor so I can keep track of when AMDGPU fan control has been taken over and disables my manual control 

So can reboot before it becomes an inferno because AMD keeps forcing the fan below 20% when it needs 60%+ to keep under 80 deg 

Nov 26 14:58:35 aio psensor.desktop[894]: [2019-11-26T04:58:34] [ERR] lmsensor: Cannot get value of subfeature fan1_input: Can't read.

Nov 26 14:58:35 aio kernel: [106737.063836] amdgpu: [powerplay] 

Nov 26 14:58:35 aio kernel: [106737.063836]  failed to send message 282 ret is 254

Nov 26 14:58:36 aio kernel: [106738.062432] amdgpu: [powerplay] 

Nov 26 14:58:36 aio kernel: [106738.062432]  failed to send message 282 ret is 254

Nov 26 14:58:37 aio psensor.desktop[894]: [2019-11-26T04:58:36] [ERR] lmsensor: Cannot get value of subfeature fan1_input: Can't read.

Nov 26 14:58:37 aio kernel: [106739.061712] amdgpu: [powerplay] 

Nov 26 14:58:37 aio kernel: [106739.061712]  failed to send message 282 ret is 254

Nov 26 14:58:38 aio kernel: [106740.061579] amdgpu: [powerplay] 

Nov 26 14:58:38 aio kernel: [106740.061579]  failed to send message 282 ret is 254

Nov 26 14:58:39 aio psensor.desktop[894]: [2019-11-26T04:58:38] [ERR] lmsensor: Cannot get value of subfeature fan1_input: Can't read.

Nov 26 14:58:39 aio kernel: [106741.061242] amdgpu: [powerplay] 

Nov 26 14:58:39 aio kernel: [106741.061242]  failed to send message 282 ret is 254

Nov 26 14:58:40 aio kernel: [106742.062397] amdgpu: [powerplay] 

Nov 26 14:58:40 aio kernel: [106742.062397]  failed to send message 282 ret is 254

Nov 26 14:58:41 aio psensor.desktop[894]: [2019-11-26T04:58:41] [ERR] lmsensor: Cannot get value of subfeature fan1_input: Can't read.

Nov 26 14:58:41 aio kernel: [106743.061627] amdgpu: [powerplay] 

Nov 26 14:58:41 aio kernel: [106743.061627]  failed to send message 282 ret is 254

Nov 26 14:58:42 aio kernel: [106744.061807] amdgpu: [powerplay] 

Nov 26 14:58:42 aio kernel: [106744.061807]  failed to send message 282 ret is 254

Nov 26 14:58:43 aio psensor.desktop[894]: [2019-11-26T04:58:43] [ERR] lmsensor: Cannot get value of subfeature fan1_input: Can't read.

Nov 26 14:58:43 aio kernel: [106745.061298] amdgpu: [powerplay] 

Nov 26 14:58:43 aio kernel: [106745.061298]  failed to send message 282 ret is 254

Nov 26 14:58:44 aio kernel: [106746.061593] amdgpu: [powerplay] 

Nov 26 14:58:44 aio kernel: [106746.061593]  failed to send message 282 ret is 254

Nov 26 14:58:45 aio psensor.desktop[894]: [2019-11-26T04:58:45] [ERR] lmsensor: Cannot get value of subfeature fan1_input: Can't read.

Nov 26 14:58:45 aio kernel: [106747.064343] amdgpu: [powerplay] 

Nov 26 14:58:45 aio kernel: [106747.064343]  failed to send message 282 ret is 254

Nov 26 14:58:46 aio kernel: [106748.061517] amdgpu: [powerplay] 

Nov 26 14:58:46 aio kernel: [106748.061517]  failed to send message 282 ret is 254

----------

## azp

Have you reported the issue as a bug to the devs?

EDIT: There seems to be a bug report on something similar: https://bugzilla.kernel.org/show_bug.cgi?id=204609

----------

## MasterCATZ

I had to turn off systemd logs 

every few days 2tb log file was created that needing nuking

----------

