# [SOLVED] MCE errors

## cfgauss

About every five minutes, I get a hardware MCE error. Here is /var/log/mcelog from my newly-installed mcelog.

Do I need to replace any hardware?

Any help interpreting this log will be gratefully received.

Motherboard: Abit IP35 Pro

CPU: Intel Core 2 Quad Q6600 Kentsfield Quad-Core 2.4 GHz

Memory: 8GB (4 x 2GB) 240-Pin DDR2

[SOLVED] krinn, below, correctly interpreted the error message as a CPU memory cache error rather than a DRAM error. I replaced the CPU with another Q6600 (2008-era CPUs are not terribly expensive today) and MCE errors disappeared. [/SOLVED]Last edited by cfgauss on Wed May 31, 2017 1:48 pm; edited 3 times in total

----------

## roboto

Did you enable Intel MCE features in the kernel .config?

----------

## cfgauss

 *roboto wrote:*   

> Did you enable Intel MCE features in the kernel .config?

 

```
# grep -i mce /usr/src/linux/.config

CONFIG_X86_MCE=y

CONFIG_X86_MCE_INTEL=y

# CONFIG_X86_MCE_AMD is not set

CONFIG_X86_MCE_THRESHOLD=y

# CONFIG_X86_MCE_INJECT is not set

# CONFIG_MCE_AMD_INJ is not set
```

Also five hours of memtest86+ produced no memory errors.

----------

## Zucca

Based on timestamps you're getting these pretty often. :\ However it looks like all the error were corrected. Maybe you have some verbose/debud flag/switch on?

I had mce errors that looked like memory errors, but memtest passed succesfully every time. I think the problem was the power delivery for the CPU. VRMs propably got too hot (resistance increased --> voltage dropped). I had too big heatsink on my CPU. The CPU fan barely spun. It caused VRMs to heat because insufficent air flow.

Now after I have bought new motherboard everything works without mce errors. VRMs too have better heat sink.

So if you can, check if something is too hot on your system.

Do these errors appear only on high CPU load?

----------

## cfgauss

 *Zucca wrote:*   

> So if you can, check if something is too hot on your system.
> 
> Do these errors appear only on high CPU load?

 

With a lightly loaded CPU, coretemp from lm_sensors registers under 40°C for each core. The frequency of errors is the same with a lightly loaded or heavily loaded CPU.

----------

## Zucca

I'm definitedly no expert in this area but 

```
MCG status:

MCi status:
```

... empty values do not make any sense.

I'd try to raise kernel loglevel and, if possible, also mcelog's.

Oh, and if you have any valuable data, now would be the time to backup if you haven't already.

----------

## cfgauss

 *Zucca wrote:*   

> I'd try to raise kernel loglevel and, if possible, also mcelog's.

 

Thanks. I'll try these changes. Where do I increase these two loglevels?

----------

## krinn

Alas for you, the reported error is not about your memory, but about the level 2 cache memory of your cpu.

As all cores share that memory, core number change, but the error itself is the same.

I'm afraid it's time to check rma status of your cpu with intel.

----------

## Zucca

 *cfgauss wrote:*   

> Where do I increase these two loglevels?

 I don't currently know (if even possible) how to set mcelog loglevel. But for kernel you can add loglevel=7 for example to kernel command line. Also be sure not to have quiet there at the same time.

----------

## cfgauss

 *krinn wrote:*   

> Alas for you, the reported error is not about your memory, but about the level 2 cache memory of your cpu.
> 
> As all cores share that memory, core number change, but the error itself is the same.
> 
> I'm afraid it's time to check rma status of your cpu with intel.

 

Thanks for this interpretation of mcelog. I'll look into getting a replacement CPU.

----------

## krinn

First thing to do, run a livecd, the report error is hardware, but a software is handling it (mcelog), and software could have bugs too. With a livecd you'll get a different environment from yours, confirming it's not software related.

don't trust krinn, he is a badass, ask other guys what they think about your issue before spending money base on what a stupid random guy said in some random forum, or just because your neighbour's kid has hack my account just to answer that to get you (you should really be nicer with that neighbour's kid).

----------

## Juippisi

Did you recently update your kernel? I had these errors with 4.10 series kernel as well. I almost paniced and threw my CPU away! But then I upgraded to 4.11 and the errors were gone...? 

The errors started coming after 4.9.15 if I remember correctly, then they occurred in every version until 4.11. I always switched back to an older kernel which didnt give these errors. 

Hope its the same for you! Running i7-2700k here.

----------

## cfgauss

 *Juippisi wrote:*   

> Did you recently update your kernel? I had these errors with 4.10 series kernel as well. I almost paniced and threw my CPU away! But then I upgraded to 4.11 and the errors were gone...? 
> 
> The errors started coming after 4.9.15 if I remember correctly, then they occurred in every version until 4.11. I always switched back to an older kernel which didnt give these errors. 

 

Thanks for the suggestion. Unfortunately I get the same errors and frequency of error under 4.9.9 and 4.11.1.

----------

## cfgauss

 *krinn wrote:*   

> First thing to do, run a livecd, the report error is hardware, but a software is handling it (mcelog), and software could have bugs too. With a livecd you'll get a different environment from yours, confirming it's not software related.

 

I get the same errors running under the Linux Rescue CD as I do with my Gentoo box under either kernel 4.9.9 or 4.11.1.

----------

## cyberhoffman

I've had a lot of mce errors recently and the only thing helped me. I'm not sure that there is a relation with you mce errors but check these options in kernel config:

```
CONFIG_INTEL_PMC_CORE

CONFIG_INTEL_PCH_THERMAL
```

they should be turned on.

----------

## cyberhoffman

 *cyberhoffman wrote:*   

> 
> 
> they should be turned on.

 

If you have proper devices of course:

    For CONFIG_INTEL_PMC_CORE:

```
 vendor: 8086 ("Intel Corporation"), device: 9d21 ("Sunrise Point-LP PMC") 
```

    For CONFIG_INTEL_PCH_THERMAL:

```
  vendor: 8086 ("Intel Corporation"), device: 8c24 ("8 Series Chipset Family Thermal Management Controller")

    vendor: 8086 ("Intel Corporation"), device: 9c24 ("8 Series Thermal")

    vendor: 8086 ("Intel Corporation"), device: 9ca4 ("Wildcat Point-LP Thermal Management Controller")

    vendor: 8086 ("Intel Corporation"), device: 9d31 ("Sunrise Point-LP Thermal subsystem")

    vendor: 8086 ("Intel Corporation"), device: a131 ("Sunrise Point-H Thermal subsystem")

```

----------

## cfgauss

 *cyberhoffman wrote:*   

> If you have proper devices of course:
> 
>     For CONFIG_INTEL_PMC_CORE:
> 
> ```
> ...

 

How do you check to see if you have a device? E.g. how was the vendor: 8086... line produced?

----------

## krinn

These are vendor:product pci codes, you can check which ones you own (at least 8086 is intel, so you must have some) with

```
lscpi -n
```

I'm afraid PCH_THERMAL is related to temperature handling (overheat is also report thru mce), which might not help in your case.

----------

## cfgauss

 *krinn wrote:*   

> These are vendor:product pci codes, you can check which ones you own (at least 8086 is intel, so you must have some) with
> 
> ```
> lscpi -n
> ```
> ...

 

lspci indicates I don't have any of the INTEL_PMC_CORE or INTEL_PCH_THERMAL devices. And I believe you're correct that the errors are related instead to the CPU memory cache.

----------

