# Graphics Card ... or Something Else [Understood]

## NeddySeagoon

Team,

I recently replaced my nVidia 980GT graphics card with a Radeon RX 460.  The idea being that I'll move the Radeon to a new system 'real soon..

Amazon Warehouse deals had a good price on the card and it dropped £30 over the weekend while I watched. 

The system is a Phenom(tm) II X6 1090T on a  M4A78T-E motherboard.

I suspect I've disturbed the 9 year old sediment because all is not well.

After the hardware swap and adding x11-drivers/xf86-video-amdgpu all seemed well - briefly.

At boot, the BIOS warned of a CPU fan issue.  All the fans are spinning, all the temperatures are OK

That continues.

Intermittently there are lockups.

Most of the time, the system will not even respond to the reset button. 

When it comes back after power cycling, its on 5 cores instead of 6.

There is nothing in any logs and ssh is unresponsive.

So far, I've updated everything graphics software related.  That's the kernel, at 4.14.0-gentoo and x11-drivers/xf86-video-amdgpu, which is now -9999.

Its early days with -9999.

I don't see a graphics card causing hard lockups and the CPU booting missing a core points to CPU.

Next, I'll try a serial console.  I used to use X-Modem to an HP iPaq, so the bits are still in the chassis.

After that, I'll swap the graphics card back.

If you have any other ideas for the investigation, please post.

-- edit --

I'm not pushing the card very hard

```
$ sudo cat /sys/kernel/debug/dri/0/amdgpu_pm_info

Password: 

Clock Gating Flags Mask: 0x37bcf

   Graphics Medium Grain Clock Gating: On

   Graphics Medium Grain memory Light Sleep: On

   Graphics Coarse Grain Clock Gating: On

   Graphics Coarse Grain memory Light Sleep: On

   Graphics Coarse Grain Tree Shader Clock Gating: Off

   Graphics Coarse Grain Tree Shader Light Sleep: Off

   Graphics Command Processor Light Sleep: On

   Graphics Run List Controller Light Sleep: On

   Graphics 3D Coarse Grain Clock Gating: Off

   Graphics 3D Coarse Grain memory Light Sleep: Off

   Memory Controller Light Sleep: On

   Memory Controller Medium Grain Clock Gating: On

   System Direct Memory Access Light Sleep: Off

   System Direct Memory Access Medium Grain Clock Gating: On

   Bus Interface Medium Grain Clock Gating: Off

   Bus Interface Light Sleep: On

   Unified Video Decoder Medium Grain Clock Gating: On

   Video Compression Engine Medium Grain Clock Gating: On

   Host Data Path Light Sleep: Off

   Host Data Path Medium Grain Clock Gating: On

   Digital Right Management Medium Grain Clock Gating: Off

   Digital Right Management Light Sleep: Off

   Rom Medium Grain Clock Gating: On

   Data Fabric Medium Grain Clock Gating: Off

GFX Clocks and Power:

   300 MHz (MCLK)

   214 MHz (SCLK)

   0.127 W (VDDC)

   0.18 W (VDDCI)

   5.50 W (max GPU)

   5.145 W (average GPU)

GPU Temperature: 29 C

GPU Load: 0 %

UVD: Disabled

VCE: Disabled
```

----------

## Jaglover

It sounds like a messed up BIOS, have you tried a hard reset? Next thing would be checking all the voltages under normal load, I wouldn't trust the M/B sensors, I'd use a real voltmeter.

----------

## NeddySeagoon

Jaglover

I have not measured voltages directly - I don't trust the motherboard sensor either.

Good idea.  I'll do that.

A power off reset works.  The only time I have seen it power up on 5 cores instead of 6 cores is after a lock up that will not respond to the reset button.

That's hard wired to the CPU reset pin.  

Why do you say its a messed up BIOS?

After the kernel has got control, the BIOS is no longer used.

I did wonder about the GPU firmware.  As the kernel part is built in, that's a kernel rebuild.  I'm following 4.14 fairly regularly. 

I'll be sure to check for fimware updates before I build the kernel, not after.

----------

## Jaglover

I meant CMOS reset like removing the battery or setting the reset jumper. The OS may get wrong ideas about hardware if it is messed up.

----------

## krinn

I would look at cat /proc/interrupts, making sure the card is using msi edge and alone on its irq and if not, trying to isolate the card by checking pci slot irq sharing on m/b manual or adjusting them if bios allow it.

(while you look, check if you have Thermal events interrupts > 0)

But I'm more scared about the cpu fan issue, the lockup and core lost.

I would get the old good card back and see if lock/fan and core problems are gone with it.

ps: yeah, in my mind, it doesn't really smell good NeddySeagoon

----------

## NeddySeagoon

Jaglover,

The voltages measured with a DVM are OK.  Sensors says

```
$ sensors

atk0110-acpi-0

Adapter: ACPI interface

Vcore Voltage:        +1.34 V  (min =  +0.85 V, max =  +1.70 V)

 +3.3 Voltage:        +3.29 V  (min =  +2.97 V, max =  +3.63 V)

 +5 Voltage:          +5.03 V  (min =  +4.50 V, max =  +5.50 V)

 +12 Voltage:        +12.44 V  (min = +10.20 V, max = +13.80 V)

CPU FAN Speed:       1002 RPM  (min =  600 RPM, max = 7200 RPM)

CHASSIS FAN Speed:    618 RPM  (min =  600 RPM, max = 7200 RPM)

CHASSIS FAN 2 Speed:  414 RPM  (min =  600 RPM, max = 7200 RPM)

CPU Temperature:      +35.0°C  (high = +60.0°C, crit = +95.0°C)

MB Temperature:       +34.0°C  (high = +45.0°C, crit = +75.0°C)

amdgpu-pci-0100

Adapter: PCI adapter

fan1:             N/A

temp1:        +26.0°C  (crit =  +0.0°C, hyst =  +0.0°C)

k10temp-pci-00c3

Adapter: PCI adapter

temp1:        +22.5°C  (high = +70.0°C)

                       (crit = +90.0°C, hyst = +85.0°C)
```

which is in good agreement with the DVM.  I can't measure Vcore.

I have not checked the ripple. That's harder.  Even though the PSU is 9 years old, its had an easy life.  Its a 850W Corsair unit, so it been well derated. 

Taking some of my own advice, I've done a visual on the Vcore regulator. Theres noting nasty there. 

I'll do a CMOS reset before I revert graphics cards.  Maybe even fit a new battery. The CMOS battery has never been replaced but it only ever does anything during power cuts as the 5v STBY is always there. 

krinn,

The graphics card is using MSI and has an interrupt to itself. There is noting nasty in /proc/interrupts. 

This morning, the system started normally.  If it would break and stay broken, it would be easy to fix.

I'm tempted to turn off three or four cores in the BIOS and see if that has any effect, besides slowing down emerge.

Thats the first step to identifying a potentially faulty core.

----------

## P.Kosunen

What power supply and how old it is?

----------

## NeddySeagoon

P.Kosunen,

The PSU is an 850W Corsair from 2009.

----------

## 1clue

Take a piece of raw chicken, lay it on top of your old nVidia card. Put it in the oven at 350F for 20 minutes, be careful that the pci pins point to the magnetic north.

Just guessing that this might be a satisfactory substitute for a whole, living chicken. Or it might really tick off the dark nVidia gods.

Seriously, I have no real help. I've never successfully switched video cards like that, across brand. Mostly this post is to subscribe me to the thread, to make it easier to find out how this turned out.

----------

## NeddySeagoon

1clue,

Welcome :)

This motherboard has a built in Radeon of some sort.  I only used it for a few days while I was waiting for the nVidia card to arrive.

Its a real Radeon too. - none of this sharing main memory for the pixel buffer.

Even by the standards of 2009, it was a poor card but there's no AGP slot so it was that or nothing for a while.

Now back to ATI/AMD, so I've done the switch both ways.

Last thing last night I emerged linux-firmware, updated the kernel to 4.14.2 and discovered that I've have had the AMD microcode updater in the kernel since 2009 but never added the microcode. Oops.

Fixed that too.

So far today its been OK.  Maybe I shouldn't say that.

----------

## Ant P.

Don't worry over the CPU missing a core - I get that with my own Phenom II (slightly older 720 model) sometimes, restarting makes it clear up. I guess it's a common problem with the microcode/BIOS/whatever.

----------

## NeddySeagoon

Ant P.

Its something that has only just started to happen.  With the hard lockups, not even responding te reset, I was wondering about a failing CPU core.

I'm glad its a feature and not a fault.

----------

## bunder

did you ever overclock it?  could be an electromigration problem...  although that's usually theorized to be a 30 year problem, not a 9 year problem.   :Laughing: 

----------

## NeddySeagoon

bunder,

Nope.  I'm an Electronics Engineer (retired), so I know better.

Even working hard it only reaches 60 C when the heatsink is clogged.

That's how I know to clean it.

----------

## krinn

 *NeddySeagoon wrote:*   

> I'm glad its a feature and not a fault.

 

 :Very Happy: 

----------

## NeddySeagoon

Well, no improvement yet.

It ran all day yesterday. After 10 min this morning, it locked up and a reset brought it back on 5 cores.

I'll let it run like that to see if that leads to any improvement.

The new kernel and CPU micocode seem to have made it less crash prone but its early days.

One day is not statistically significant.

----------

## P.Kosunen

 *NeddySeagoon wrote:*   

> The PSU is an 850W Corsair from 2009.

 

Depending on model, it might be cause of issues. Models with all japanese caps should be still good, but if it has some inferior caps included, those might be going bad. (IIRC Corsair made both, good ones with all japanese caps, but also not so good ones.)

----------

## John R. Graham

 *NeddySeagoon wrote:*   

> Well, no improvement yet.

 Have you tried just swapping back to the old card yet? Just to confirm that symptoms go away, that is.

- John

----------

## NeddySeagoon

John R. Graham,

That's the next step.  If normality returns, 

it will only confipm that its a system issue. That's not the same as saying its a graphics card problem.

Ask any early Ryzen adopter.  :)

----------

## krinn

 *NeddySeagoon wrote:*   

> John R. Graham,
> 
> That's the next step.  If normality returns, 
> 
> it will only confipm that its a system issue. That's not the same as saying its a graphics card problem.
> ...

 

No NeddySeagoon, it would confirm it is really a system issue and answer:

- does the new card do this (incompatibility somewhere, or just some tweak params to find)

- or does your system has turn bad while you manipulate it or because the new card has bork it.

You were having no lock down with the nvidia, all was fine, getting back the nvidia could tells you, if you still need to fight with parameters or if your system is now damage (which fighting with parameters would do nothing).

For now you are battling against the card/kernel/whatever on a system that might have an issue not because of the card itself.

You really should just get back the nvidia card and see if all is fine ; if yes, it would confirm you're not fighting against the wind.

----------

## Jaglover

Maybe the new card draws too much power from m/b. Does it have a separate power connector?

----------

## NeddySeagoon

Jaglover,

No, its motherboard powered. Its a fanless XFX R460

For the time being its in a PCI. 2.0 slot but its supposed to be backwards compatible.

I've never seen the card draw more that 7W.

The card it replaced had a 6 pin connector for 12v.

----------

## Jaglover

Totally off topic, but I'm curious. Why British disrespect Alessandro Volta? Because he is not one of them, foreign? All units named after a person are uppercase, yet British write v for volts, but W for watts - named after James Watt.   :Razz: 

----------

## NeddySeagoon

Jaglover,

We write A for Ampere and André-Marie Ampère was one of the old enemy ;)

I would mV for millivolts, not mv,  and kV for kilovolts. Remember CRT EHT power supplies?

I wonder if there is some ambiguity about the symbol V alone?

Other that it being a Roman Numeral, none comes to mind.

----------

## 1clue

 *Jaglover wrote:*   

> Totally off topic, but I'm curious. Why British disrespect Alessandro Volta? Because he is not one of them, foreign? All units named after a person are uppercase, yet British write v for volts, but W for watts - named after James Watt.  

 

I'm curious why the lack of capitalization denotes disrespect as opposed to lack of knowledge or maybe as just plain laziness?

Speaking as a lazy American who  has been using v for volts and a for amps without ever knowing or wondering about capitalization for the past 40 years.

----------

## krinn

 *NeddySeagoon wrote:*   

> Other that it being a Roman Numeral, none comes to mind.

 

Also mean V for victory (well, "victoire" which is victory in French), because easy made with fingers, resistance use it, but i'm unsure if it came from that, De Gaulle did too, see last phrase)

did i win most off topic answer?  :Smile: 

----------

## NeddySeagoon

krinn,

The V for victory came from the Battle of Agincourt where the French came second.

 *Quote:*   

> ... that the French had boasted that they would cut off two fingers from the right hand of every archer ...

 

The V is supposed to be derived from the English archers showing off their bowstring fingers.

That's English history anyway :) 

Back to topic.

After rebooting on 5 cores, the system ran all day yesterday.

This morning, its back on 6 cores. 

5 cores or 6 cores make a difference to the 12v PSU load.  I'm not sure if the PSU has a split 12v supply or not.

The reduced 12v load may be a pointer.

After the next incident, I'll go back to the old graphics card.

----------

## Tony0945

 *NeddySeagoon wrote:*   

> The V for victory came from the Battle of Agincourt where the French came second.
> 
>  *Quote:*   ... that the French had boasted that they would cut off two fingers from the right hand of every archer ... 
> 
> The V is supposed to be derived from the English archers showing off their bowstring fingers.
> ...

 

I had heard long ago that Churchill used it because the English use the two finger salute while the Americans use the middle finger salute.

Works for the English archers at Agincourt also.

----------

## krinn

 *NeddySeagoon wrote:*   

> The V is supposed to be derived from the English archers showing off their bowstring fingers.
> 
> That's English history anyway  

 

Funny because they were archers of King Henry V and the logical choice that came to my mind is showing their allegiance to King V rather than their bowstring fingers 

We more came last than second  :Very Happy: 

And you wonder why we have cut those stupid idiots King's head?

Tony0945: look at https://en.wikipedia.org/wiki/V_sign#Victory_sign

Sad it derail the subject so bad, because it's interesting point where the V sign came from.

----------

## Tony0945

 *krinn wrote:*   

> And you wonder why we have cut those stupid idiots King's head?

 

I'd like to cut off a stupid idiot President's head. (Sorry I lapsed into politics)

 *Quote:*   

> 
> 
> Tony0945: look at https://en.wikipedia.org/wiki/V_sign#Victory_sign

 

So tha'ts why those few bars from the Fifth keep being played in  the movie "The Longest Day"!

 *Quote:*   

> 
> 
> Sad it derail the subject so bad, because it's interesting point where the V sign came from.

 

Yes, such an interesting conversation, but you are right, mon ami.

----------

## NeddySeagoon

krinn,

I was trying to avoid an international incident :)

Back to topic.

The symptom recurred, so I'm back to the 

```
NVIDIA Corporation G92 [GeForce 9800 GT] (rev a2)
```

The BIOS CPU Fan warning has gone away.  I suspect that its linked with the BIOS cool'n'quiet.  

It may not understand a fanless Graphics card and probably uses the same string everywhere. 

I'll boot with the GPU fan stalled and see what happens.

While I was in the case,  I rearranged the dust so I could read the PSU label.  

Its a Corsair CMPSU-850TX

5V 30 A

3V3 30A

12V 70A

-12V 0.8A 

5V Stby 3A

Note the use of uppercase V to denote Volts :)

No split 12V supply.

The CPU is about 120W. That's 10A  from the +12V

Four HDD spin motors' about 1.2A each .,. 5A

Say another 120W for the old graphics card 10A. It has a separate +12V connector.

That's about 25A total, when the old graphics card is working hard.

With the new card, the total 12V load is less but its distributed differently.

Being lazy, I've been using the dual link DVI cable with both cards.

Maybe moving to display port with the new card may help ?

----------

## Tony0945

 *NeddySeagoon wrote:*   

> While I was in the case,  I rearranged the dust so I could read the PSU label.  

 

That may be your problem. I would blow out the PSU, the CPU cooler, the graphics card and the motherboard generally.

I use a 5 gallon compressor that's in my garage for car tires. I lower the pressure to 20 psi (have no idea what that is in metric), it's usually set at 40 for the tires (or tyres  :Smile: )

I have to hold the nozzle quite close to the CPU cooler to get all the dust out of the fins. Dust in the PSU can cause lots of trouble and so can old capacitors. But as you are a fellow engineer, I needn't tell you that.

----------

## NeddySeagoon

Tony0945,

I clean the dust off when the CPU gets to 60C under load.

I use a stiff natural bristle brush, so I don't run the risk of static damage due to dry air becoming charged.

Air out of a compressor is dried as a side effect of being compressed.

20 psi is about 1.3 bar but I still check my tyres is psi :)

The PSU may be overdue a clean but with the total load down, I don't see it as a dust or thermal issue.

Its also very difficult to remove without taking everything  out of the case.

The 12V  connector for the CPU regularly gets charred. I have to keep cleaning that.

Four pins (total) for the Vcore is not enough.

I'll try wiping the contacts on the main motherboard power connector.  It wall be carrying more 12v current that it used to.

----------

## P.Kosunen

 *NeddySeagoon wrote:*   

> Its a Corsair CMPSU-850TX

 

http://www.corsair.com/en-eu/tx850w

http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/20793-corsair-tx850-850w-power-supply-review-7.html

 *Quote:*   

> High quality Japanese capacitors provide uncompromised performance and reliability.

 

This model should be still good.

----------

## Tony0945

I have seen warnings about using a compressor, thus have lowered the pressure. I don't like the canned air because it leaves a greasy residue. The brush sounds good but doesn't it raise a static risk also? 

Your PSU is apparently made by ChannellWell rather than Seasonic http://whirlpool.net.au/wiki/psu_manufacturers

I'm not familiar with them but apparently they made the Antec Smartpower 2.0 which was not bad but not terribly efficient at low power levels.

Last year I replaced mine with an Antec EDGE made by Seasonic. I now have one Seasonic PSU, two Antec branded Seasonic's and one Antec branded Delta.

I don't like the charred connector. I have never had that even though I run my 125W Phenom II 24 hours a day (Seasonic S12II 430B). The PSU is rated at 17A http://www.hardwaresecrets.com/seasonic-s12ii-bronze-430-w-power-supply-review/6/

----------

## 1clue

When you're cleaning out your machine you should leave it plugged in but powered off. That leaves the ground in place, which protects your electronics.

Any time you have air moving across anything it generates a bit of static electricity. The vacuum, the brush, whatever all have some risk of that.

AFAIK the biggest problem with a compressor is that the air coming out of it is not just air. It also has water droplets, dirt, oil and whatever else got sucked into the inlet. Including small bits of metal.

----------

## NeddySeagoon

Tony0945,

Natural bristle is not a very good insulator, unlike say, nylon bristles.

While the brushing action will raise static, its conducted away again too.

The charred connector is a feature.  This motherboard was probably not designed for a six core Phenom II.

The BIOS it was shipped with certainly wasn't.  I had to do a BIOS update before it would boot the six core CPU.

It looks like the CPU power connector does not have enough pins for a six core Phenom II working hard, so the connector overheats.

Its a feature I'm aware of and manage.

----------

## Tony0945

Which mobo? Mine is Gigabyte GA-880GA-UD3H. CPU is AMD Phenom II X6 1090T

It has onbard Radeon graphics but the Windows Catalyst drivers were so bad that I disabled it and installed an EVGA GeForce 8400GS 

I don't think it has a fan. Right now it's at 121F (50 C).

----------

## NeddySeagoon

Tony0945,

```
Base Board Information

   Manufacturer: ASUSTeK Computer INC.

   Product Name: M4A78T-E

   Version: Rev 1.xx

```

This board also has an onboard Radeon.  It wasn't a good GPU when it was new and its been disabled since then.

The old GPU, now fitted, says

```
nouveau-pci-0100

Adapter: PCI adapter

GPU core:     +1.05 V  (min =  +0.95 V, max =  +1.10 V)

fan1:        1494 RPM

temp1:        +54.0°C  (high = +95.0°C, hyst =  +3.0°C)

                       (crit = +95.0°C, hyst =  +5.0°C)

                       (emerg = +135.0°C, hyst =  +5.0°C)
```

Its not working hard either.

----------

## NeddySeagoon

After a wee while back on the old nVdia card, I've refitted the RX 460

The CPU Fan Error went away when the nVdia card was reinstalled.

Its stayed away with the RX 460 refitted. ... well, its only one boot so far.

I'm still using the Dual Link DVI port, I'll need to rake about for for display port cables.

-- edit --

Rebooted to give the motherboard power connectors a good 'graunch' in an attempt to reduce the contact resistance.

----------

## NeddySeagoon

It might be motherboard RAM ..

Its failed 3 times today.  On 5 core, 6 cores and while doing a world update.

Always on the display port connector.

I was about to try the new AMD code in kernel-4.15 but I get segfaults trying to build it

However, each invocation of make builds a bit more.

I've kicked off memtest86+ its half way through the first pass.

-- edit --

memtest86+ the first pass, so I've stopped it

I still get 

```
/bin/sh: line 1: 29635 Segmentation fault      ./tools/objtool/objtool orc generate --no-fp "arch/x86/kernel/x86_init.o"
```

with varying file names, even after rebuilding objtool.

However, its building on my mediaserver ...

----------

## Tony0945

Neddy, time for a new mobo https://www.ebay.com/itm/Genuine-Gigabyte-Technology-GA-880GA-UD3H-Rev-2-2-AM3-AMD-Motherboard-1/112660553884?epid=129638825&hash=item1a3b17a09c:g:f3YAAOSwbF1aHNqy

Or perhaps a full upgrade. I think these are new enough to be free of the segfault bug. https://www.newegg.com/Product/Product.aspx?Item=N82E16819113434&cm_re=ryzen_1600-_-19-113-434-_-Product plus mobo plus new memory.

----------

## Ant P.

I'm in a position to volunteer as test subject for that: I have a GA-MA770-UD3, and I've got a RX550 coming early next week. These Gigabyte boards (the -UD3 ones) are advertised as having thicker traces for power so maybe that'll make some difference.

The card specs say it pulls max 50W through the motherboard - my current one is a 6450 which is something weak like 25W. Either way I'm expecting a decent light show, whether it's on screen or on the mobo.

----------

## NeddySeagoon

Tony0945,

I'm planning a full upgrade.  I got the video card early as I though I might get a 4k display too :)

However my existing 9 year old video card won't drive 4k and I'm not one to keep a Ferrari just to drive to the corner shop.

Amazon had a good deal on this one and I wanted fanless ...

My motherboard is PCIe 2.0, maybe 2.1.  The card is PCI 3.0, so it can get a lot more data down each lane.

Hmm ... Its in 8 lane mode now.  I can set it in the BIOS.  Maybe the timing balances on each lane are not what they need to be.

Heh a single lane Graphics card  :) 

Look at the newegg prices and change the $ to £.  Buying from a US site costs me import duty and VAT.  That adds abut 40% to the cost.

In some cases, its lower cost to fly to the USA, do your shopping, some sight seeing and put the stuff in your hold baggage. 

Ant P.

I look forward to hearing about your light show.

-- edit --

Choosing the stackframe unwinder avoids the segfaults on the Phenom.

Its running on 4.15.0-rc1, with the new AMD kernel code and a single PCIe lane just now.

----------

## Tony0945

 *NeddySeagoon wrote:*   

> Look at the newegg prices and change the $ to £.  Buying from a US site costs me import duty and VAT.  That adds abut 40% to the cost.
> 
> In some cases, its lower cost to fly to the USA, do your shopping, some sight seeing and put the stuff in your hold baggage. 

  That was just for an example. I would assume there is a european source similar, amazon.uk?  That sucks about the duty. Taxes are taxes. Some years ago I bought a DVD from amazon.uk (The Tripods: The White Mountains) that was not available in the US. I don't think they were supposed to sell it me because of copyright but they did. No duty, just shipping. Not even Illinois "use tax". Years later I joined a class action lawsuit (the usual postcard offer) about the exchange rate and got the default settlement that cut the price way down. Go figure. I don't mind artists getting compensated for their work. I think that's fine. But copyright has become a way for mega-corporations to restrain trade and that's wrong. There was a bookshop in Toronto that I visited in the mid-80's. For 15 years after, I bought British editions of books there that weren't available in the US. I used my credit card by phone and then internet and they mailed me the books.

The used board from ebay hopefully wouldn't have VAT tax.  I suppose Brexit will just make it worse.

----------

## NeddySeagoon

After two days of kernel-4.15.0-rc1 using the hew AMD driver that appears there and operation on a single PCIe lane, I've put the BIOS back to auto to detect the PCIe lane count.

I really know better than to change several parameters at the same time but a series of controlled experiments to do a binary search works too.

If this locks up, I'll put it down to PCIe lane trace length and impedance matching on a 9 year old motherboard and live with it until after Christmas, then get a new system in the January sales.

At least I can put it back to single PCIe lane operation.

There is also another slot I can try on the motherboard but that means gutting the system to rearrange all the plug in cards.

Single PCIe lane operation seems to work for my graphics workload.  

Maybe a collection of 16 lane graphics cards all working on the same frame is overrated :)

----------

## Tony0945

 *NeddySeagoon wrote:*   

> If this locks up, I'll put it down to PCIe lane trace length and impedance matching on a 9 year old motherboard and live with it until after Christmas, then get a new system in the January sales.

 

Neddy, what CPU do you favor for your new system?

----------

## NeddySeagoon

Tony0945,

Ryzen, or dual Ryzen in the same package :)

----------

## ZeuZ_NG

@NeddySeagon,

have you tried to update BIOS? Sorry if you did, tried to read the whole thread but didn't catch if you did so..

If getting segfaults randomly, I would also try swapping memories.. And wouldn't risk doing a live update on the BIOS..

Rather take out a programmer like ch341a wich is cheap and supports most desktop BIOS chips

----------

## NeddySeagoon

ZeuZ_NG,

I have the latest BIOS.  Its old but there is noting to update it to.

mtest86+ passes.

I think the segfaults are not related to the graphics problem because if I change a kernel option the kernel builds without any segfaults.

Its a new option in 4.15.0 too, so maybe it has a bug. 

I'll test a while longer with 4.15.0-rc1 and in a week or two, try another -rc.  I don't want any more variables that I can help.

That makes the problem space bigger.

----------

## adriend

I got the same error building kernel 4.14.4-gentoo with CONFIG_ORC_UNWINDER=y and this thread is the only reference to it on the internet :

 *NeddySeagoon wrote:*   

>  
> 
> ```
> /bin/sh: line 1: 29635 Segmentation fault      ./tools/objtool/objtool orc generate --no-fp "arch/x86/kernel/x86_init.o"
> ```
> ...

 

I just switched the linker to ld.bfd (instead of ld.gold) and it builds fine.

```
binutils-config --linker ld.bfd
```

The linker can of course be switched back to ld.gold after kernel build.

(as a side note I switched to BFD linker because of x32 VDSO build error but then I tried again CONFIG_ORC_UNWINDER and it worked...)

----------

## NeddySeagoon

I'v been poking at this every few days, trying something else.

Different kernels, the new amdgpu in kernel-4.15

Putting the graphics card in the primary and secondary slots.

Turning off various combinations of CPU cores.

Everything changed the frequency of the lockups but nothing eliminated them.  

Its now run almost three days without a lockup. 

I had to turn off Message Signalled Interrupts. 

MSI going wrong explains the lockups. The CPU gets the address of the IRQ service routine in the message and begins executing code there ... only sometimes the message is garbled.

I'll put it down to using a PCIe v3 card in a PCIe v2.1 slot and needing to use the probably not well tested PCIe fallback (to the card) signalling. 

The problem is not really solved but it is understood.

-- edit --

Maybe not.  I turned the screensaver on and went away for a few hours.

It was locked up when I came back.

----------

