# mce: [Hardware Error]: Machine check events logged

## Zucca

I was putting my PC into hibernate and instead of hibernating a moment or two later it just rebooted.

The boot phase stops at udev colplugging and reboots again.

I added systemd.unit=emergency.target to kernel command line and managed to enter emergency shell. I mounted l /var to read journal, but as always there's nothing indicating what's wrong.

I'm currently running Memtest from live media... so far no errors detected.

I did however scrub my btrfs partition, no erros were found, but at some point when in emercency shell mce: [Hardware Error]: Machine check events logged

 was printed into vc. I guess I'd need mcelog to have some sane logs of what happened, but as far as I know I haven't installed it.

At this point I can only ask some wild guesses what's wrong. So anyone experienced any similar problems?

----------

## eccerr0r

Yes you'll need mcelog to see what events were logged but it still may or may not be clear what exactly happened.

Usually these problems are due to failing hardware however, could be anything from a bad northbridge to bad CPU.  Overclocking can trigger this as well.

It also could be corruption from cosmic rays.

So indeed these are all wild guesses for now...

----------

## Zucca

I've stayed away from overclocking.

After memtest I'll try to run Fedora 24 live to see if it fails too.

----------

## Zucca

Well. Memtest found no errors. Fedora Live is working fine.

```
mcelog: ERROR: AMD Processor family 21: mcelog does not support this processor.  Please use the edac_mce_amd module instead.

: Success

CPU is unsupported
```

... and after modprobing that module, results are exact same.   :Rolling Eyes:   My CPU is FX-8350.

I'd hate to buy new CPU or MB right now when I'm eagerly waiting for AMD Zen release to see if it's worth...   :Evil or Very Mad: 

EDIT: After being unable to find any errors I managed to boot my Gentoo system fine. As of right now I'm investigating all possible logs.

Also if anyone knows how to get mce log working with my setup, please share your ideas here.  :Smile: 

```
System:    Host: wren Kernel: 4.8.6-gentoo-wren x86_64 (64 bit) Desktop: i3 4.10.4

           Distro: Gentoo Base System release 2.2

Machine:   Mobo: Gigabyte model: GA-78LMT-USB3 6.0 v: SEx Bios: Award v: F2 date: 11/25/2014

CPU:       Quad core AMD FX-8350 Eight-Core (-HT-MCP-) cache: 8192 KB 

           clock speeds: max: 4000 MHz 1: 1400 MHz 2: 1400 MHz 3: 2800 MHz 4: 1400 MHz 5: 1400 MHz

           6: 1400 MHz 7: 1400 MHz 8: 2100 MHz

Graphics:  Card-1: Advanced Micro Devices [AMD/ATI] RS780L [Radeon 3000]

           Card-2: Advanced Micro Devices [AMD/ATI] Fiji XT [Radeon R9 FURY X]

           Display Server: X.Org 1.18.4 driver: (unloaded: radeon)

           Resolution: 1920x1200@59.95hz, 1920x1080@60.00hz

           GLX Renderer: Gallium 0.4 on AMD FIJI (DRM 3.3.0 / 4.8.6-gentoo-wren, LLVM 3.7.1)

           GLX Version: 3.0 Mesa 12.0.1

Audio:     Card-1 Advanced Micro Devices [AMD/ATI] Device aae8 driver: snd_hda_intel

           Card-2 Advanced Micro Devices [AMD/ATI] RS780 HDMI Audio [Radeon (HD) 3000 Series]

           driver: snd_hda_intel

           Card-3 Advanced Micro Devices [AMD/ATI] SBx00 Azalia (Intel HDA) driver: snd_hda_intel

           Card-4 Logitech driver: USB Audio

           Sound: Advanced Linux Sound Architecture v: k4.8.6-gentoo-wren

Network:   Card: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller driver: r8169

           IF: ethmb0 state: up speed: 1000 Mbps duplex: full mac: <filter>

Drives:    HDD Total Size: 1936.4GB (7.5% used) ID-1: /dev/sda model: SAMSUNG_SSD_830 size: 256.1GB

           ID-2: /dev/sdb model: KINGSTON_SV300S3 size: 120.0GB

           ID-3: /dev/sdc model: KINGSTON_SV300S3 size: 120.0GB

           ID-4: /dev/sdd model: KINGSTON_SV300S3 size: 480.1GB

           ID-5: /dev/sde model: KINGSTON_SV300S3 size: 480.1GB

           ID-6: /dev/sdf model: KINGSTON_SUV400S size: 480.1GB

Partition: ID-1: / size: 890G used: 119G (14%) fs: btrfs dev: /dev/sda3

           ID-2: /home size: 890G used: 119G (14%) fs: btrfs dev: /dev/sda3

           ID-3: /var size: 890G used: 119G (14%) fs: btrfs dev: /dev/sda3

           ID-4: /boot size: 496M used: 200M (43%) fs: ext2 dev: /dev/md1

           ID-5: swap-1 size: 18.72GB used: 0.00GB (0%) fs: swap dev: /dev/md5

RAID:      Device-1: /dev/md5 - active raid: 5 components: online: 6/6 - sda2 sdf2 sde2 sdd2 sdc2 sdb2

           Device-2: /dev/md1 - active raid: 1 components: online: 6/6 - sda1 sdf1 sde1 sdd1 sdc1 sdb1

Sensors:   System Temperatures: cpu: 23.2C mobo: N/A

           Fan Speeds (in rpm): cpu: N/A

Info:      Processes: 284 Uptime: 4 min Memory: 448.9/32041.1MB Client: Shell (codetag.sh) inxi: 2.2.19
```

EDIT02:

I'm unable to set mcelog up and running.

```
● mcelog.service - Machine Check Exception Logging Daemon

   Loaded: loaded (/usr/lib64/systemd/system/mcelog.service; disabled; vendor preset: disabled)

   Active: failed (Result: exit-code) since Wed 2016-11-02 20:02:41 EET; 4min 56s ago

  Process: 1612 ExecStart=/usr/sbin/mcelog --ignorenodev --daemon --foreground (code=exited, status=1/FAILURE)

 Main PID: 1612 (code=exited, status=1/FAILURE)

Nov 02 20:02:41 wren systemd[1]: Started Machine Check Exception Logging Daemon.

Nov 02 20:02:41 wren systemd[1]: mcelog.service: Main process exited, code=exited, status=1/FAILURE

Nov 02 20:02:41 wren systemd[1]: mcelog.service: Unit entered failed state.

Nov 02 20:02:41 wren systemd[1]: mcelog.service: Failed with result 'exit-code'.

Nov 02 20:02:41 wren mcelog[1612]: mcelog: AMD Processor family 21: Please load edac_mce_amd module.

Nov 02 20:02:41 wren mcelog[1612]: : Success

Nov 02 20:02:41 wren mcelog[1612]: CPU is unsupported
```

```
CONFIG_X86_MCE=y

# CONFIG_X86_MCE_INTEL is not set

CONFIG_X86_MCE_AMD=y

CONFIG_X86_MCE_THRESHOLD=y

CONFIG_X86_MCE_INJECT=m

CONFIG_EDAC_DECODE_MCE=y

# CONFIG_MCE_AMD_INJ is not set
```

... I should have all required things, but do I really need that as module...

----------

## eccerr0r

I have an old AthlonXP that gets these MCEs occasionally.  I don't know what corruption it generates, but the machine still operates normally; I'd never notice until I look in the logs...

Also check your cooling, it may also affect it.

----------

## Zucca

 *eccerr0r wrote:*   

> Also check your cooling, it may also affect it.

 I'm suspecting exactly that since just before I tried to put my system into hibernate I was having boinc running.

I need to take the case open again and see if I could replace bridge heatsinks and maybe even VRMs VRMs also get hotter as the CPU uses more current.  And while I'm in it, I could also adjust the CPU fan so that VRMs get more forced air.

I think the 125W (TDP) CPU I have currently on it is the maximum that the MB can handle.

So, yeah. It may just be overheating issue, but it's definitely not CPU or GPU overheating.

For the next motherboard I'll sure check that it has bigger heat sinks for all the required components.

Oh and this... I hope they get this thing worked out with AMD.

----------

## dr_wulsen

my mce messages did also show up in dmesg... temerature limit... blah... temp normal. anything there?

and once i screwed up a machine, that was quite old (p2) with acpi, but i guess thats no issue nowadays no more.... 

is the acpi quirks enabled in your kernel -  and was hibernation working before? if so, ignore my post please.

----------

## Zucca

 *dr_wulsen wrote:*   

> my mce messages did also show up in dmesg... temerature limit... blah... temp normal. anything there?

 I get nothing since my CPU isn't supported (link on my prev post) by mcelog. *dr_wulsen wrote:*   

> and was hibernation working before? if so, ignore my post please.

 Hibernation was working. I think the problem was overheated VRMs.

Oops. I didn't ignore your post.   :Shocked: 

----------

## dr_wulsen

once i had a machine crashing due to hot and unstable vrms, a workaround was to put some paper/carton around the heatsink to force the fresh air through the vrm area.  best results were obtained with the fan flipped upside down, so it would take in the cool air from the bottom, sucked through the vrm area,effectively cooling them whilst the air was still cold enough to keep the cpu temp low.

this required putting paper/tape around the heatsink, leaving only the bottom few inches (which were redirected through the vrm area) free. that did the trick, may be worth a try if flipping the fan usd is applicable?

----------

## Zucca

 *dr_wulsen wrote:*   

> once i had a machine crashing due to hot and unstable vrms, a workaround was to put some paper/carton around the heatsink to force the fresh air through the vrm area.

 That would definitedly help this case here. My heat sink does not have any "air guides". So a lot of cool iar is actually hitting only the front of the heat sink. So... I'll try to block the sides of it and direct the air that's comming out of the CPU heat sink to the VRM heat sink.

(pictures linked are not from my setup)

----------

## dr_wulsen

nice cooling  :Smile: 

The trick with the paper/carton worked well with the old fans blowing from the top towards the board,

dunno if here it works well without taking away cooling from the CPU.

For my current setup which is also blowing through sideways, a solution was using a power supply with a big fan on the bottom, facing towards the CPU, which sucks the hot air out of the case quite efficiently, also cooling the PSU on my board.

Sure enough the CPU cooling is mounted to take air in from the bottom of the case and blow it directly into the PSU so the heat is getting transferred as quick as possible.

Just curious how it turns out to look after your modification and if it resolves the issue - please post pictures of it  :Smile: 

----------

## eccerr0r

I'm just using stock block heatsinks and fans with no heatpipes and it works fine...

The machines that I have heatpipes on get hotter than the ones that don't...

(Note: the only machine that I run heatpipes on are... laptops.)

----------

## Anon-E-moose

I'm running an 8320 overclocked to 8350 speed, and it worked fine for 99.99999% of the time,

but I did see an mce error 2 or 3 times over a couple of year period.

In my case I simply bumped the voltage up on the cpu, northbridge and memory by a very miniscule amount and haven't had a problem since.

With most motherboards, with time the components age, and do sometimes need a slight bit more voltage to remain stable.

Not saying that is your problem, but it could be. 

As far as mce for AMD, you should have 2 modules edac_core and edac_mce_amd, if set up as modules.

For .config I have 

```
CONFIG_X86_MCE=y

# CONFIG_X86_MCE_INTEL is not set

CONFIG_X86_MCE_AMD=y

CONFIG_X86_MCE_THRESHOLD=y

# CONFIG_X86_MCE_INJECT is not set

CONFIG_EDAC_DECODE_MCE=y

# CONFIG_AMD_MCE_INJ is not set
```

It worked the time I did use it, but it's been over a year, and I don't remember much about it now.

----------

## Zucca

 *eccerr0r wrote:*   

> I'm just using stock block heatsinks and fans with no heatpipes and it works fine...

  Before I switched to FX -series, my Phenom II had only 95W of TDP. It stayed under 60°C with only one fan, a case fan. Back then my setup was practically silent.

 *Anon-E-moose wrote:*   

> As far as mce for AMD, you should have 2 modules edac_core and edac_mce_amd, if set up as modules.
> 
> For .config I have 
> 
> ```
> ...

 

Doesn't work here. :(

```
CONFIG_X86_MCE=y

# CONFIG_X86_MCE_INTEL is not set

CONFIG_X86_MCE_AMD=y

CONFIG_X86_MCE_THRESHOLD=y

CONFIG_X86_MCE_INJECT=m

CONFIG_EDAC_ATOMIC_SCRUB=y

CONFIG_EDAC_SUPPORT=y

CONFIG_EDAC=y

CONFIG_EDAC_LEGACY_SYSFS=y

# CONFIG_EDAC_DEBUG is not set

CONFIG_EDAC_DECODE_MCE=y

CONFIG_EDAC_MM_EDAC=m

CONFIG_EDAC_AMD64=m

CONFIG_EDAC_AMD64_ERROR_INJECTION=y

# CONFIG_EDAC_E752X is not set

# CONFIG_EDAC_I82975X is not set

# CONFIG_EDAC_I3000 is not set

# CONFIG_EDAC_I3200 is not set

# CONFIG_EDAC_IE31200 is not set

# CONFIG_EDAC_X38 is not set

# CONFIG_EDAC_I5400 is not set

# CONFIG_EDAC_I5000 is not set

# CONFIG_EDAC_I5100 is not set

# CONFIG_EDAC_I7300 is not set

# CONFIG_MCE_AMD_INJ is not set
```

```
mcelog: AMD Processor family 21: Please load edac_mce_amd module.

: Success

CPU is unsupported
```

... and I cannot find the module either:

```
/lib/modules/4.8.6-gentoo-wren/kernel/arch/x86/kernel/cpu/mcheck/mce-inject.ko

/lib/modules/4.8.6-gentoo-wren/kernel/drivers/edac

/lib/modules/4.8.6-gentoo-wren/kernel/drivers/edac/amd64_edac_mod.ko

/lib/modules/4.8.6-gentoo-wren/kernel/drivers/edac/edac_core.ko
```

Interestingly:

```
modprobe: ERROR: could not insert 'amd64_edac_mod': Cannot allocate memory
```

----------

## Roman_Gruber

When my box logs mce events is because the thermal paste of the cpu should be changed again. ~6 months usually and its ~1 year now

 *Quote:*   

> Nov  8 13:25:36 ASUS-G75VW kernel: CPU5: Package temperature above threshold, cpu clock throttled (total events = 8809)
> 
> Nov  8 13:25:36 ASUS-G75VW kernel: mce: [Hardware Error]: Machine check events logged
> 
> Nov  8 13:25:36 ASUS-G75VW kernel: mce: [Hardware Error]: Machine check events logged
> ...

 

You may turn on some verbose stuff in your kernel settings.

For my box, when I run i7z, i see that the multiplicator is only 30x instead of 31x for all 4 cores at max on ivybridge. And temperatures are around 90 degress centigrade.

----------

## Zucca

 *Roman_Gruber wrote:*   

> For my box, when I run i7z, i see that the multiplicator is only 30x instead of 31x for all 4 cores at max on ivybridge. And temperatures are around 90 degress centigrade.

 My CPU temperature rarely exceeds 60°C.

Anyway. I made some air guides in hopes to get better cooling. So far at idle temperatures have dropped by about 2°C. And I'm watching VRM heatsink temperature with my multimeter temp probe. Now @ 33°C.

BeforeVRM heatsink 1 caps there for size comparisonVRM heatsink 2 just because it looked artsyAn air guide for GPUAn air guide for CPU... So I hope this solves it... Although I doubt it. However the cost was zero so I only lost some time.

----------

## Roman_Gruber

Let's rephrase my answer.

There is a kernel option which enables verbosity. Than you may see such messages as I do. Than it's cooling related.

You just should enable the verbosity of that error message or better said "warning" message

60 degress centigrade is not really that worse as you already mentioned. 

Not sure on how the mce differs regarding amd and intel

----------

## Zucca

 *Roman_Gruber wrote:*   

> Let's rephrase my answer.
> 
> There is a kernel option which enables verbosity. Than you may see such messages as I do. Than it's cooling related.
> 
> You just should enable the verbosity of that error message or better said "warning" message

 I've tried to find the kernel config option that enables the verbosity. No dice. I'll go trough edac and mce settings by hand and enable all in hopes of getting the messages I want.

----------

## Zucca

The CPU Air Guide definitedly made huge a difference. Now with full load on CPU, the temperature varies between 50 and 52°C. At first it climbed around 58°C but eventually settled. The VRM heat sink stays at 49°C. Fingers cossed now...

----------

## dr_wulsen

happy to hear it did have an effect, but i'm still curious how it turns out with the original isssue. please keep us updated  :Smile: 

----------

