# Issue with incorrect temperature thresholds for AMD r9 380

## ZenoOfElea

I recently installed a AMD R9 380 Tonga-pro graphics card and upgraded to 4.8.17-gentoo-source kernel and after some tinkering I was able to get lm-sensors /  sysfs to expose the hwmon information for the card however I think I did not configure something correctly as the threshold temperatures were populated at 0.0c.

The only error message logged to dmesg is the following 

```
amdgpu 0000:02:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff

```

My concern for this is two fold is my card uses "passive cool" technology with a thermal trip to start the fans if the GPU reaches a certain temperature and I am concerned if the threshold temperatures are invalid the cards fans might not trip and I want to use the sysfs to set the minimum fan levels to be higher than 0% and I want to make sure any script I make has the right hardware information.

I was not sure if I did not build modules for the i2c chips so I just compiled all the ones for AMD but that did not help and I do not really know where to go from here.

----------

## Roman_Gruber

Please try with a more recent kernel. thank you. => kernel.org stable kernel version.

Those read out values may be trustworthy or not.

 *Quote:*   

> My concern for this is two fold is my card uses "passive cool" technology with a thermal trip to start the fans if the GPU reaches a certain temperature and I am concerned if the threshold temperatures are invalid the cards fans might not trip and I want to use the sysfs to set the minimum fan levels to be higher than 0% and I want to make sure any script I make has the right hardware information. 

 

Sounds like you played with the firmware of the card which you should not have done in the first place.

When the card overheats, it may throttle (again the firmware of the device)

You may mount some additional fans and undervolt them so they are not that noisy.

A proper airflow needs some thinking and decent preparation. A proper case with proper placement of the components helps too

--

Such things are usually handled by the firmware. 

The shipped gpu bios should usually handle everything

--

When you want to have a proper reading of temperatures. you can mount a temperature sensor with a data logger.

----------

## Hu

Zeno: bogus sensor data reported to the user is unfortunately not that uncommon.  This could be an issue with the kernel driver mishandling the data, or that the sensor's firmware is non-standard and requires a kernel quirk to compensate for the vendor's ideas of how to express the data.  Assuming that the pieces are all from the same vendor and that the issue is that the vendor expressed the data in a non-standard way, then the vendor's firmware is probably able to do the right thing regarding starting the fans when needed.  If the fans are configured correctly by the vendor, they ought to start in plenty of time to prevent hardware damage, especially if you terminate the test quickly.  You will need an independent way to validate the actual temperature, since we cannot currently trust any of the data reported in software.  I suggest checking the documented temperature tolerances before designing any tests.  Find the temperature at which the fans are supposed to engage and, if possible, find the temperature at which the hardware could begin to suffer permanent damage.  Once you know how large a margin exists between those temperatures, we can design a test to see if the fans activate when they should.  That test will likely involve generating GPU load in order to generate heat.  We need to know how hot to make the card to activate the fans and we ought to know how long the card can safely withstand operating under load without fans.

----------

## ZenoOfElea

After doing some digging online I found that the average temperature was around 65c and the windows based card manipulation program set the threshold before the fans would king in was 66c. After doing some stress bench-marking and basing my reading from the lm_sensors and the raw temp1_input value the time the fans kicked on was indeed 66c. I am thinking that ASUS the cards manufacturer did not build into their card a way to expose the temp1_crit temp1_crit_hyst values

Since the fans kick on roughly at the same time as the 66c threshold is hit and cut off when the temperature drops to 65c. I will go under the assumption that the temp1_input value is valid. However leaving my machine on while idling my card the temperature still creeps up to 66c which seems way to high for an idle card.

Looking around the directory hwmon directory for my card I found 4 other files (pwm1, pwm1_enable, pwm1_max, pwm1_min) and a directory of interest.. Using https://www.kernel.org/doc/Documentation/hwmon/sysfs-interface as a guide to explain what the files mean I have made some observations that I do not fully understand. pwm1_max and pwm1_min are not listed in the kernel sysfs docs and are read only like temp1_input so I am assuming these two are static min/max variables for pwm1 and the kernel docs seem to agree as the values match for the minimum and maximum values pwm1 can have.  pwm1_enable and pwm1 are both set to read write. pwm1_enable default value for my system seems to be 1. with 0 meaning no fan control and fan on max and >1 meaning automatic fan control. However despite being a r/w file pwm1_enable will not change its default value of 1. pwm1 is where I start to become confused because the documentation says file has the min value of 0 and max value of 255 and the value of the file is equal to the percentage of the fan speed. The default value is 61 (~20%) despite the fans being off, I can manipulate the variable with the echo command however the value I echo into the file is not the value the file changes to. I have not played with it enough to be sure but changing the files value affects the fan speed, setting a >140 value gives a noticeable impact on the sensor value and with a visual inspection I can tell the fans are on. 

```
sh -c 'echo 150 >  /sys/class/drm/card0/device/hwmon/hwmon3/pwm1'
```

 seems to produce a value of 145 for my card but 

```
sh -c 'echo 255 >  /sys/class/drm/card0/device/hwmon/hwmon3/pwm1'
```

 produces a value of 255. I do not understand the logic.

The only other thing of interest in the hwmon directory is a sub-directory labeled power containing 5 files that are not listed in the guide I gave above  autosuspend_delay_ms ,control ,runtime_active_time ,runtime_status ,runtime_suspended_time which when examined return cat: /sys/class/drm/card0/device/hwmon/hwmon3/power/runtime_status_time: No such file or directory, auto, 0, unsupported, 0 respectively. These might be vendor specific and since I do not know what they do I am not messing with them.

Assuming my assumption is correct and the pwm1 value is the one I need to manipulate to change my fan speed is there any safety concerns for my card if I use echo to change the value of pwm1 after boot up?

----------

## Roman_Gruber

 *Quote:*   

> Since the fans kick on roughly at the same time as the 66c threshold is hit and cut off when the temperature drops to 65c. I will go under the assumption that the temp1_input value is valid. However leaving my machine on while idling my card the temperature still creeps up to 66c which seems way to high for an idle card. 

 

Nope.

As long as the "processor" does not throttle its fine. i refer to any silicion in this case.

The only issue may be that some parts will degrade faster and need to be replaced earlier. As we are talking about gpus, the lifespan is much shorter as the lifspan of the components, so it should not matter.

--

When you are worried, you can hardwire the fan, improve the cooling system. e.g. replace with a better thermal component, imrpove the air-flow. ... as mentioned a decent case can improve a lot

--

 *Quote:*   

> Looking around the directory hwmon directory for my card I found 4 other files (pwm1, pwm1_enable, pwm1_max, pwm1_min) and a directory of interest.. Using https://www.kernel.org/doc/Documentation/hwmon/sysfs-interface as a guide to explain what the files mean I have made some observations that I do not fully understand. pwm1_max and pwm1_min are not listed in the kernel sysfs docs and are read only like temp1_input so I am assuming these two are static min/max variables for pwm1 and the kernel docs seem to agree as the values match for the minimum and maximum values pwm1 can have. pwm1_enable and pwm1 are both set to read write. pwm1_enable default value for my system seems to be 1. with 0 meaning no fan control and fan on max and >1 meaning automatic fan control. However despite being a r/w file pwm1_enable will not change its default value of 1. pwm1 is where I start to become confused because the documentation says file has the min value of 0 and max value of 255 and the value of the file is equal to the percentage of the fan speed. The default value is 61 (~20%) despite the fans being off, I can manipulate the variable with the echo command however the value I echo into the file is not the value the file changes to. I have not played with it enough to be sure but changing the files value affects the fan speed, setting a >140 value gives a noticeable impact on the sensor value and with a visual inspection I can tell the fans are on. 

 

You are looking at hardware registers. 

maybe they are wrong coded and are read only registers!

It depends on the Firmware of any component involved and on the schematics. Schematics refer to any path the electrons flow through!

Those are int. property of the manufacturer and usually not revealed, so you won*t be able to see if these can be changed, or if it is even possible.

 *Quote:*   

> equal to the percentage of the fan speed

 

nope the fan characteristics regarding pwm is for sure not linear. now we are in the analog world, not the digial one.

 *Quote:*   

> Assuming my assumption is correct and the pwm1 value is the one I need to manipulate to change my fan speed is there any safety concerns for my card if I use echo to change the value of pwm1 after boot up?

 

from an electronic perspective, nope

you may overheat your card because you ruined the cooling, the processor (again see above), may throttle down to get back in teh safe zone. Thse days marketing speech talks about a power target and a thermal target (budget). just naming things different to make things fancy, complicated!

When you want to be on the safe side, hardwire the fan to max, or pull out the pwm, which usually makes the fan to spin at max.

pwm shall only be used when there is enough headroom. pwm is not that efficent than using a hardwired fan. 

some guys replace the stock cooling because its limited.

----------

## ZenoOfElea

 *Quote:*   

> 
> 
> When you want to be on the safe side, hardwire the fan to max, or pull out the pwm, which usually makes the fan to spin at max. 

 

I am not exactly sure how to hardwire the fan to max when I installed the card I did not see any jumpers or switches that would indicate a way to hardwire the fan to maximum and the cooling system is rather exotic not a reference design.

 *Quote:*   

> 
> 
> Sounds like you played with the firmware of the card which you should not have done in the first place.
> 
> 

 

Actually I did not flash the firmware of the card at all.

 *Quote:*   

> you may mount some additional fans and undervolt them so they are not that noisy.
> 
> A proper airflow needs some thinking and decent preparation. A proper case with proper placement of the components helps too 

 

As for Case airflow that really is not an issue my cpu idles around 28c and maxs at around 48c and if I manipulate the pwm register controlling fan speed to a higher value than the one given my card sits idle at 34c and goes up to about 55 - 60c on full load.

 *Quote:*   

> 
> 
> As long as the "processor" does not throttle its fine. i refer to any silicion in this case. 

 

If by throttle you mean the fan itself it does if I do nothing on a fresh reboot. Starting when it hits 66c and stopping at 64c. However if by throttle you mean the clock speed I do not know why it would I keep it at stock speed.

----------

