# MCE Errors

## Myu

Hello everyone,

I got a log of "MCE Errors" thrown at me at boot and on all TTY since a day, I installed app-admin/mcelog and added CONFIG_X86_MCELOG_LEGACY=y to my kernel to be able to look at them at ease and it doesn't look good :

```
CPUID Vendor Intel Family 6 Model 58

mcelog: Trigger `cache-error-trigger' exited with status 1

mcelog: Trigger `cache-error-trigger' exited with status 1

mcelog: Cannot collect child 7516: No child processes

Hardware event. This is not a software error.

MCE 0

CPU 1 BANK 8 

MISC 3022024086 ADDR 1031680 

TIME 1509642356 Thu Nov  2 18:05:56 2017

MCG status:

MCi status:

Error overflow

Corrected error

MCi_MISC register valid

MCi_ADDR register valid

Threshold based error status: yellow

Large number of corrected cache errors. System operating, but might lead

to uncorrected errors soon

MCA: corrected filtering (some unreported errors in same region)

Instruction CACHE Level-2 Instruction-Fetch Error

CPU 1 on socket 0 has large number of corrected cache errors in Level-2 Instruction

System operating correctly, but might lead to uncorrected cache errors soon

Cannot find sysfs cache for CPU 1Running trigger `cache-error-trigger'

STATUS cc56550000071152 MCGSTATUS 0

MCGCAP c09 APICID 2 SOCKETID 0 

CPUID Vendor Intel Family 6 Model 58

Hardware event. This is not a software error.

MCE 1

CPU 0 BANK 8 

MISC 3022024086 ADDR 1031680 

TIME 1509642356 Thu Nov  2 18:05:56 2017

MCG status:

MCi status:

Error overflow

Corrected error

MCi_MISC register valid

MCi_ADDR register valid

Threshold based error status: yellow

Large number of corrected cache errors. System operating, but might lead

to uncorrected errors soon

MCA: corrected filtering (some unreported errors in same region)

Instruction CACHE Level-2 Instruction-Fetch Error

CPU 0 on socket 0 has large number of corrected cache errors in Level-2 Instruction

System operating correctly, but might lead to uncorrected cache errors soon

Cannot find sysfs cache for CPU 0Running trigger `cache-error-trigger'

STATUS cc5654c000071152 MCGSTATUS 0

MCGCAP c09 APICID 0 SOCKETID 0 

CPUID Vendor Intel Family 6 Model 58
```

Anyone has experience with that kind of errors ? Does it means my CPU is dying ?

I got two system freezes so far (while in a VM and while starting firefox)

Appreciate any insight  :Smile: 

----------

## eccerr0r

Yes, usually it's a bad sign if it had been working before.

If it has always been doing MCE errors, you should look into new BIOS, trying another kernel (or another distribution) (mostly to rule out kernel config issues).

If it just started doing it with a known stable configuration, check fan for cleanliness and heatsink/compound, stop overclocking or underclock it, else you may well be looking into hardware replacements (motherboard, CPU, possibly PSU).

----------

## Myu

Thanks for the reply eccerr0r (username almost checks out !  :Mr. Green: )

It never did that kind of errors before, the thing is sometimes it boots with the errors, sometimes not even with no hardware change in between...

Indeed some other post suggested various overheating issues or bad PSU, I cleared the dust and put another fan to work, recheck all connections, the errors are still there but seems my system can run (at least for what I tried so far : light browsing)

I'll look into underclocking and yes, maybe it's time to keep an eye on thrift store deals.

----------

## Myu

Damn it's pretty crazy, htop & cat /proc/cpuinfo were showing 3 cores then 2 then one... instead of 4 (i5-3470)

Edit : Ok, mcelog is disabling CPU cores due to L2 cache errors... makes sense.

----------

## eccerr0r

I'd look into M/B or PSU issues and not the CPU if more than one core is failing... though it's still possible for a CPU failure, it's not likely.

----------

## Ant P.

Ouch, that's bad. Neat that it manages to keep running like that though...

----------

## krinn

 *eccerr0r wrote:*   

> I'd look into M/B or PSU issues and not the CPU if more than one core is failing... though it's still possible for a CPU failure, it's not likely.

 

Generally heat can lead to malfunction, which mean randomness, when you have always the same core, and always the same error type, well, you better but any bet on a lottery ticket...

----------

## eccerr0r

Well unless the clock distribution or all cores share the same l2, a bug that shows up in one core should remain in that core.

However for that second case or any other shared resource, the cores should know an error is coming from the l2/shared resource and would know it's not the core's fault... that is, if the detection logic is smart, it should be disabling l2 chunks instead.

Of course you likely would not get the same error each time but it should happen on the same core - if it were a normal chip problem.  Heating the chip is actually a "normal" problem and the error should stay on the same core.  All bets are off it was a pre-production chip that had other issues that are not "normal" problems...

----------

## Myu

 *Quote:*   

> Ouch, that's bad. Neat that it manages to keep running like that though...

 

Yes, pretty neat to see the cores hot plugging functionality working flawlessly =)

 *Quote:*   

> I'd look into M/B or PSU issues and not the CPU if more than one core is failing... though it's still possible for a CPU failure, it's not likely.

 

I also think there's a good chance it's the PSU, right now I lack another compatible one to test.

 *Quote:*   

> Generally heat can lead to malfunction, which mean randomness, when you have always the same core, and always the same error type, well, you better but any bet on a lottery ticket...

 

Indeed, the thing is right now my temps are just fine

After the cleaning, I've been able to play 2 hours on The Witcher 3 on a Windows VM (via KVM/VFIO) so I guess it's stressful enough to the hardware, no problems at all but the errors are still there.

----------

## Myu

Ok so since then I got one dead motherboard... just brutally stopped working never to be bootable (POST) again

I purchased a second MB (different chipset but compatible) to try to salvage the hardware (it's an i5 and 16GB of RAM after all) but the MCE errors are still present after setting up the new MB and my old Gentoo.

Time for another CPU I guess ?

----------

## NeddySeagoon

Myu,

If you have moved all the bits to a mew motherboard, it does not leave much.

That assumes new, not just different.

The CPU cache has single bit error detection and correction, hence it can work correctly with one bit errors.

Two bit errors will always be detected but cannot be corrected, so the core will be shut down.

----------

## krinn

 *Myu wrote:*   

> Ok so since then I got one dead motherboard...

 

Might also be worth seeing some exorcist  :Smile: 

----------

## Myu

@NeddySeagoon

Yes indeed, moved all the bits into the new MB but that's not a brand new one since it's hard to find those new at a decent price these days (Intel Core 3rd Gen)

Most of the time the system works fine, correcting errors along the way according to mcelog (as you said, under a certain limit, all goes well), but I'm not keen on keeping it like that if it has a tendency to fry my hardware eventually  :Surprised: 

I've another test in mind, swap the CPU to another one I have lying around and which is socket-compatible, if I stop getting MCE -> The i5 is probably dead, If not, another piece is faulty.

@krinn

 :Mr. Green:  I know right, this is getting higher and higher on my troubleshooting list

----------

## NeddySeagoon

Myu,

Look at the Vcore regulator on the motherboard next to the CPU.  It takes in the 12v from the PSU, via the dedicated connector with the black and yellow wires, on converts it to the voltages used by the CPU and RAM.  Look for bulging or leaking capacitors.

When they begin to fail the CPU operating voltages are no longer properly regulated.

A motherboard swap and a CPU swap are both good tests.

You would be unlucky to get two failing Vcore regulators

----------

## Myu

Hello Neddy,

From what I can tell, the VRM's on the MB looks sane, nothing special about them which is good

I swapped the CPU and so far, not a single MCE error in sight, I'll keep this monitored but it seems a clear indication it's the CPU after all.

Thank for your help, very appreciated as always  :Smile: 

----------

