# mce Hardware error

## Kosmas

Hello,

I just found out that in my dmesg there is an mce error

```
mce: [Hardware Error]: Machine check events logged
```

After running mcelog, I get the following:

```
Hardware event. This is not a software error.

MCE 0

CPU 0 BANK 6 

MISC 43880014086 ADDR fef1ce80 

TIME 1496394010 Fri Jun  2 12:00:10 2017

MCG status:

MCi status:

Error overflow

Uncorrected error

MCi_MISC register valid

MCi_ADDR register valid

Processor context corrupt

MCA: corrected filtering (some unreported errors in same region)

Generic CACHE Level-2 Generic Error

STATUS ee0000000040110a MCGSTATUS 0

MCGCAP c08 APICID 0 SOCKETID 0 

PPIN ee0000000040110a

CPUID Vendor Intel Family 6 Model 142

Hardware event. This is not a software error.

MCE 1

CPU 0 BANK 7 

MISC 7880014086 ADDR fef1ce40 

TIME 1496394010 Fri Jun  2 12:00:10 2017

MCG status:

MCi status:

Error overflow

Uncorrected error

MCi_MISC register valid

MCi_ADDR register valid

Processor context corrupt

MCA: corrected filtering (some unreported errors in same region)

Generic CACHE Level-2 Generic Error

STATUS ee0000000040110a MCGSTATUS 0

MCGCAP c08 APICID 0 SOCKETID 0 

CPUID Vendor Intel Family 6 Model 142
```

And the hardware (lspci) is the following:

```
00:00.0 Host bridge: Intel Corporation Device 5904 (rev 02)                                                                                                                                    

00:02.0 VGA compatible controller: Intel Corporation Device 5916 (rev 02)                                                                                                                      

00:04.0 Signal processing controller: Intel Corporation Skylake Processor Thermal Subsystem (rev 02)                                                                                           

00:14.0 USB controller: Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller (rev 21)                                                                                                    

00:14.2 Signal processing controller: Intel Corporation Sunrise Point-LP Thermal subsystem (rev 21)

00:15.0 Signal processing controller: Intel Corporation Sunrise Point-LP Serial IO I2C Controller #0 (rev 21)

00:15.1 Signal processing controller: Intel Corporation Sunrise Point-LP Serial IO I2C Controller #1 (rev 21)

00:16.0 Communication controller: Intel Corporation Sunrise Point-LP CSME HECI #1 (rev 21)

00:17.0 SATA controller: Intel Corporation Sunrise Point-LP SATA Controller [AHCI mode] (rev 21)

00:1c.0 PCI bridge: Intel Corporation Device 9d10 (rev f1)

00:1c.4 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #5 (rev f1)

00:1c.5 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #6 (rev f1)

00:1f.0 ISA bridge: Intel Corporation Device 9d58 (rev 21)

00:1f.2 Memory controller: Intel Corporation Sunrise Point-LP PMC (rev 21)

00:1f.3 Audio device: Intel Corporation Device 9d71 (rev 21)

00:1f.4 SMBus: Intel Corporation Sunrise Point-LP SMBus (rev 21)

01:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Topaz XT [Radeon R7 M260/M265 / M340/M360 / M440/M445] (rev c3)

02:00.0 Network controller: Intel Corporation Wireless 3165 (rev 79)

03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8101/2/6E PCI Express Fast/Gigabit Ethernet controller (rev 07)
```

Does anyone have a clue as to what this error might be?

The hardware is a Dell inspiron 15 Series 5000 with an i7 (Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz)

Thank you in advance.

----------

## krinn

look at: https://forums.gentoo.org/viewtopic-t-1063672.html

but don't forget to read this

keep in mind parts of computers may have different rma status: check your rma status with dell, and check the rma status of your cpu if dell rma is over.

----------

## Kosmas

Thanks so much krinn.

The laptop is still in warranty, but I have to run diagnostics for dell to accept the problem.

----------

## krinn

keep us in touch, i don't have one myself, but other dell users may enjoy seeing how dell have deal your issue.

----------

## Kosmas

Hi krinn,

Just to update the forum, I got my hands on a second Dell Inspiron (mine is with i7, the other has an i5), and I get the same hardware error.

It seems that it is not an actual error, rather than a problematic interpretation of the processor cache or something similar.

I will try to run the diagnostics on both laptops, and then if I get any kind of error, I will contact Dell.

Thanks again,

Kosmas.

----------

## krinn

You should know that cpu is reporting an error, it's a cpu event, it mean you don't check if cpu have an error, it's when the cpu have an error that the cpu trigger the event.

I don't know for amd, but intel report heat errors thru mce, and that's really good : stress cpu can fail when not use in normal conditions (heat, overclocking, unstable or too low/up voltages...) ; because of unusual conditions cpu may fail and trigger the error, it might still not mean cpu is damage, getting it back to normal conditions may fix the issue.

The software part that may bug, is kernel thinking cpu has report an mce but cpu didn't, that would be kinda a huge one, and many people will get it.

Or interpreting the error, that's mcelog and other utility, while it is highly probable mce type of errors may not be correctly interpret, because (without really knowing), amd and intel are used to use different code everywhere, and an mce error code for amd may say something different with an intel. The problem is even if you are unsure about the mce type of error, still an mce did happen.

That's why i recommand user to use a livecd and see what's going on: first, and it's important, you must reboot, meaning resetting cpu and removing whatever bad context have disturb it, second, you are using another kernel version, and it's a kernel all livecd users will use, lowering risk a bug in that kernel won't be seen by others.

If mce still occurs with livecd, it mean your hardware, even after reboot still have the error, giving a sad but good clue the error is persistent ; either because the cpu is still in bad conditions or damage. But if you don't yourself put the cpu in non normal conditions, then an hardware is failing there.

About your two computers: while i could admit myself it's low probability, and would be really bad luck, still it might be that the two computers are indeed failing. I'm a bit surprise you have the same error on both i must say, or do you mean "same, because both report mce"?

----------

## cyberhoffman

 *krinn wrote:*   

> look at: https://forums.gentoo.org/viewtopic-t-1063672.html
> 
> 

 

```
CONFIG_INTEL_PMC_CORE=y

CONFIG_INTEL_PCH_THERMAL=y
```

I say it solves the issue.

----------

