# Isolating memory failures

## ExecutorElassus

I've had a randomly recurring problem with system freezes, usually when doing Gaming Things (shoutout to wine's excellent progress with supporting modern games). I ran furmark without issue, and ran 'stress' on the CPU, also without issue. However, running memtest86+ caused the computer to switch off and restart, usually about 10% into the test. I pulled out all the RAM sticks, and put them in one by one, getting the same problem (at different points, lasting longer if I told memtest to use SMP) on every stick in any slot. 

This was suggested to me elsewhere to be an fault with the mobo's RAM bus. Is there a way to isolate this better? The problem with system freezes occurs more frequently when it's hot out, which suggests to me some component on the mobo is burned, but I'd like to know a bit better before I plunk down another €500 for a new CPU/mobo combo.

Cheers,

EE

----------

## eccerr0r

If the computer powers down during running a RAM test, I'd look into power problems.

Since you mention that it fails more often during hot weather, you should see if the cooling system is working.  Ensure everything is clean of dust.

Your RAM is probably fine, but motherboard, power supply (including the motherboard ones) are suspect.

----------

## ExecutorElassus

CPU and GPU are on their own water loop with an external radiator. The case is vertically oriented (that is, the ports are on the top of the case, not the back, so airflow moves bottom-to-top over the cards). 

I was told that PSUs only last "5-8 years," and this one's over a decade old. Is there any way to test that the PSU is faulty besides swapping it out?

Cheers,

EE

(this would be, honestly, a preferably diagnosis to a bad mobo, because the PSU is cheap to replace and I'd have to replace the entire CPU/RAM/mobo set otherwise)

----------

## Jaglover

You could measure the voltages on ATX connector, don't unplug it, it is important to measure under load. However, this test would not reveal if some rail provides "dirty" power. But then again, if it is that old why not replace it. There are PSU testers, but the cheap ones are useless, they do not put any load to the PSU.

----------

## NeddySeagoon

ExecutorElassus,

If its PSU, you need to test the dynamic regulation. That's hard.

The problem is that the voltages need to stay within spec when the CPU goes from almost nothing to full power in one CPU clock. 

With a 3 GHz CPU clock that's not very long. (3.33ps)

It gets worse. The CPU and memory subsystem have their own on the motherboard PSU.  This takes 12v out of the tin can PSU and turns it into the voltages required by the CPU and memory.

This bit gets a very hard life and as result, fails more often than the PSU you are thinking of replacing.

At over 10 years old, if the rest of the system is of an age with the PSU, failures here can be often be spotted with Mk1 eyeball.

Look at the capacitors around the CPU. Be sure that they are not leaking, bulging, or tipped over. That's all signs of failure.

Replacing these parts, they must all be done together, is a job requiring intermediate soldering skills.

So far, you have identified a systems problem that is probably not RAM.

Test the RAM with mentest86+ in another system.

----------

## ExecutorElassus

Hi Neddy!

the mobo/CPU/RAM were all from 2016; the GPU from 2009, the PSU from around 2008.

The intermittent problem I'm having is that the system will freeze, requiring hard reboot. It happens more often during warm weather, and more often when doing memory-intensive things (or, at least I assume that's the case since the game I play only uses one core really and doesn't do a lot of HDD writes). 

As I said, another contact suggested the RAM bus is failing. But maybe it's the PSU on the mobo?

I'll have to ask around to see if I can find anybody who even has a desktop. 

Is there any other way to test besides finding another machine?

Cheers,

EE

----------

## NeddySeagoon

ExecutorElassus,

In 2016, the memory controller was built into the CPU. The bus is the tracking and terminating resistors at each end.

These resistors are on the RAM sticks at the RAM stick end and on the motherboard at the CPU end. 

Asumming you have 4 or 6 memory sockets, then 4 or 6 parts of the RAM bus have failed.

That's unlikely.

Can you post images of the region of the motherboard around the CPU?

Don't remove the CPU heatsink yet, lets see what we can see first.

----------

## ExecutorElassus

Hi Neddy,

the best I could do without cracking the case is this photo. Sorry for all the tubing (and yes, the green coolant suggests to me that the GPU could probably do with a replacement; I'm gonna try that by year's end). 

See anything useful there?

Cheers,

EE

----------

## NeddySeagoon

ExecutorElassus,

Its difficult to see in that image.

There are 11 tubular silver things at the top and top right side of the water block.

There are more above the finned black and red heatsink. Thats the bits we need to see.

The finned heatsink carries the switching transistors for the CPU power supply.

The black D on the tops indicates polarity. That's important if/when you come to replace the parts.

The connector in the top right of the image, with the black and yellow wires is the 12v input to the CPU PSU.

My connector is charred and it has most of the plastic missing from the PSU cable.  Every now and again it goes high resistance and I nave to clean it.

Yours looks OK. Its worth pulling in apart to inspect.

Better images would be useful.

----------

## ExecutorElassus

Hi Neddy,

I finally cracked the case open and got some better photos. I don't see any obvious damage here, and the cable appears to be properly seated (though there's a second cable, lower down and smaller, going into the mobo that I didn't check). Can you see anything here that looks like an obvious culprit?

Thanks for the help,

EE

----------

## NeddySeagoon

ExecutorElassus,

Both photos look good.

-- edit --

All look good.  - Missed one.

----------

## ExecutorElassus

all right. So should I go back to swapping out the PSU, or is there some other check I can make to try to narrow it down?

Cheers,

EE

----------

## NeddySeagoon

ExecutorElassus,

Testing by substitution is indeed the next step.

Order doesn't matter but just one thing at a time. If you have a PSU to hand, swap it.

Likewise, try parts from this system elsewhere.

----------

## artbody

I don't know a lot about PSU's

but i've always (on my PC)

sys-apps/lm_sensors 

and GKrellM for visualisation

installed and configured, 

so i can always see what temperature  the GPU and CPU has.

the other thing i would suggest is a memtest

----------

## ExecutorElassus

hi artbody,

I have gkrellm running. Neither CPU nor GPU ever get near redline (CPU maxes out around 55°C under 100% load, occasionally spiking to a bit over 60°C; GPU never gets above about 45°C). 

As I said upthread, running memtest causes the machine to switch off and restart. 

Unfortunately, I don't have a spare machine, and I don't know anybody who has a PC. I could maybe ask at the University if the lab has a spare PSU they could lend, and try that out. I'll let you know.

Cheers,

EE

----------

## ExecutorElassus

Update: I installed a new PSU. Same problem with memtest. I don't know yet if the machine freezes (it does it randomly). However, I also dusted off all the fans, and now both CPU and GPU temps are much lower, even when gaming. 

I wonder if the problem might be heat-related. With both the CPU and GPU water-cooled, they don't register very high temperature, and since the case fans are motherboard-controlled, maybe there's not enough airflow in the case and some other component is overheating? It seems to freeze more when it's hot out, and less after I clean the fan filters. That still doesn't explain the switch-off running memtest.

But in any case, with a new PSU I still don't know what the problem is, because it evidently isn't solved completely yet. Neddy, any idea what to try next? I was going to see if I could borrow a vid card from the University lab, and see if maybe that might be the problem. The GPU is now the oldest component (it's almost 9 years old), and I heard that vid card problems can affect memtest. 

Any other suggestions?

Cheers,

EE

----------

## NeddySeagoon

ExecutorElassus,

Try turning off Message Signalled Interrupts. Add  

```
pci=nomsi
```

to your kernel line in grub.conf.

You get a small performance penalty for that.

Normal IRQs and MSI work quite differently.

It the old way, the address of the interrupt service routine is stored in a look up table. If the IRQ is shared, the service routine has to query every device in the list until it finds the device that raised the IRQ.

With MSI, the device is programmed with the address of the IRQ when the interrupt service routine is installed.

When the IRQ is acknowledged, the device puts this address on the bus and the CPU jumps to it.

Its more complex and has tighter timing constraints that the old way. 

Sometimes, it fixes hard to track down lockups. When MSIs fail, the CPU can jump anywhere.

Test. When the system locks up, in the CPU halted?

In the halt state, pressing the reset button will not restart the CPU. Only the power button will bring it out of halt.

If the CPU is halted. its got itself in a big mess, like it would if it jumped to something that was not code in response to an interrupt.

GPUs generate a lot of IRQs ...

Look at /proc/interrupts

----------

## ExecutorElassus

Hi Neddy,

I'll try disabling MSI on next reboot. Would that also affect memtest?

When the system locks up (which so far has only once happened outside of gaming, and then immediately after closing the game), the reset button restarts the machine.

So far, though, after dusting off my fan filters, I haven't had any lockups. But since it's quite random, I don't know if that means anything.

I'm going to try borrowing a vid card and see if that solves the memtest issue.

Stay tuned,

EE

----------

## NeddySeagoon

ExecutorElassus,

Turning off MSI fixes lots of marginal timing things.

----------

## C5ace

Had the same problem during our summer with my 9 year old 24/7 system. Fixed it by replacing the dried out termal paste with fresh termal paste between the CPU and heatsink.

----------

## ExecutorElassus

yup, that too. I've since learned that good thermal paste only lasts maybe 6 months, so now I replace it regularly. I also (again) discovered that I need to give the whole case, and especially the fan filters, a thorough vacuuming at least a few times a year. Drops the running temp down a good 30°C.

But since the freezes I was having happened apparently randomly (and only really when gaming) I have no way to know whether I've resolved the problem, except by inference and probability. The longer it goes without freezing, the more likely it looks that my problem, perhaps unrelated to memtest, was simply a problem with something in the case overheating.

I'll keep y'all posted.

Cheers,

EE

----------

## NeddySeagoon

ExecutorElassus,

It really helps if you can generate a simple test case.

Keep in mind too that absence of evidence is not evidence of absence.

So you can't prove the problem is not there any more.

----------

## ExecutorElassus

Neddy, you are exactly right. I may have worded it poorly, but that's what I was getting at: all I really know so far is that it hasn't frozen again yet. And my main problem has been, from the beginning, that I can't isolate what's causing the system to freeze in the first place. I just know that it happens when gaming, and seems to happen less when the case/fans/radiator have been dusted. But the proximate cause remains unknown. 

So, I guess, I'll just keep trying to get it to freeze, and keep trying different tests (should be able to borrow a vid card soon), and keep you posted.

Cheers,

EE

----------

## NeddySeagoon

ExecutorElassus,

I think that phrase was attributed to Carl Sagan. 

I'm sure I heard him use it with regards to SETI and SETI@home.

----------

## eccerr0r

I kind of doubt a video card could cause memtest failures especially in newer machines where busses are mostly separated from each other.  But it could be a first especially if the video card for whatever reason causes overload.

On older machines with a separate northbridge, I have had instances where the northbridge overheating or failing, causing memory failures.  This shouldn't be the case for modern machines however - unless the CPU has gone bad.  BTW do you get memory failures on cacheable vs uncacheable tests?

When I see overheating problems on my PC, it's usually due to dust clogging heatsink fins -- which shouldn't be a problem with water blocks -- but the heatsink compound is a common denominator...  IMHO if heat sink compound is applied properly (versus blathered all over the place), I think it should last longer between applications.  Then again I really try hard not to need to remove the heatsink so I don't need to reapply heatsink compound, and I have machines that have yet to replace the heatsink compound since initial assembly when new.

----------

## ExecutorElassus

Well, I've been running now for a week or two with a new PSU, and it hasn't frozen yet. So I don't have any further information, except for this: I saw in dmesg the following error messages:

```
[ 8410.034472] mce: [Hardware Error]: Machine check events logged

[ 8410.034478] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 2: 98254000000c0176

[ 8410.034482] mce: [Hardware Error]: TSC 0 MISC c008000100000000 

[ 8410.034488] mce: [Hardware Error]: PROCESSOR 2:600f20 TIME 1555132314 SOCKET 0 APIC 0 microcode 6000822

[15258.199168] mce: [Hardware Error]: Machine check events logged

[15258.199173] mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 2: dc25407000040136

[15258.199176] mce: [Hardware Error]: TSC 0 ADDR 7b039ad38 MISC c008000300000000 

[15258.199178] mce: [Hardware Error]: PROCESSOR 2:600f20 TIME 1555139162 SOCKET 0 APIC 1 microcode 6000822

[15569.479162] mce: [Hardware Error]: Machine check events logged

[15569.479164] mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 2: dc2540e000040136

[15569.479168] mce: [Hardware Error]: TSC 0 ADDR 705799f78 MISC c008000700000000 

[15569.479171] mce: [Hardware Error]: PROCESSOR 2:600f20 TIME 1555139474 SOCKET 0 APIC 1 microcode 6000822

```

Any idea what that is? The CPU is an AMD FX-9590, from 2016. 

Cheers,

EE

----------

## NeddySeagoon

ExecutorElassus,

At face value its a CPU problem.

However, if you have ECC RAM, it can be a RAM problem too.

The CPU has ECC on the internal caches and without ECC errors go undetected.

The good news is that the error was detected and corrected.

Detected uncorrectable errors get you a panic.

----------

## JustAnother

I had my own little freezing issue lately.

-- A dual core AMD machine, ~2008.

-- The "cpu fan bracket" has two hooks that live under a lot of stress.

-- About three years ago I heard a loud sound like a bolt being thrown against the case. Then the machine started shutting down. I realized that a fan bracket hook had broken with great gusto, and the cpu was overheating. For a few bucks I got it running. Case closed - or so I thought.

-- Recently the same machine started freezing without warning. No messages. No pings. No ssh's. No nothing. Just stone silence.

-- Aha! It must be firefox - it had just upgraded. Turned out not the be the problem.

-- Aha! I had just dockerized the kernel. I must have messed up a working kernel with all those fancy schmancy docker switches. Turned out not to be the case.

-- Aha! Electromigration -- people say the processor is only good for about 10 years, and those damn atoms have jostled around one too many times. Nope.

-- Aha! A cosmic ray nailed the cpu. Nope.

-- Aha! It must be the memory going senile because it dune wore out. I had never seen anything special happen when I ran memtest, but "they" said to do this. To my surprise, the computer seemed to freeze during memtest. After several more runs memtest failed, saying there was a bogus hardware interrupt on cpu 1, shutting it down. Okey, so it's the hardware, not the dockerized kernel. 

-- It's only 11 years old, so maybe it's time to upgrade. But I can't stand the thought of rebuilding this thing from scratch. 

-- So I figured it might be wise to open the case and look for anything obvious, like some sparks or some ugly black stains.

-- Aha! I found something. The heat sink and fan assembly didn't feel right -- it was too loose. Apparently one of the two fan bracket plastic hooks had failed, but unlike the previous failure where the hook went flying like a bullet, two sides failed and it pivoted along the third side. -- The net effect was that one side of the cpu was held too loosely, and the other side was held way too loosely. This puts a gradient onto the thermal conductance per unit area, leading to an asymmetric cpu failure. This sneaky little problem was making the computer freeze, not shut down.

-- A new part for $5 seemed to fix everything. I just smeared around that nasty grease with my finger.

----------

