# Two months of debugging - unstable computer

## uraes

Long story short: I have tried to stabilize my Gentoo installation for almost 2 months and no success. Until now I thought that this is problem in Gentoo kernel (tried: 3.8.13, 3.10.7, 3.10.10, 3.11.0, 3.11.1) as I was unable to reproduce unstability on any other installation I tried - Win7, ubuntu, kubuntu, mint, fedora, estobuntu. I tried to remove all hard disks, removed nvidia videocard and pci ethernet card. ran memtest for day.. With live OS'es nothing, but Gentoo was crashing (and before os crash sometimes programs also crashed, like gkrellm, firefox, konsole, ..) randomly as sometimes just GUI halted, sometimes keyboard lights were blinking and sometimes I got some randome kernel traces also on screen.

But now, after 1d long running Estobuntu (almost all time some movie was running, shared some ubuntu iso over bittorrent, etc - computer was doing smth all the time) I got some new bits - in dmesg were

```

[ 9026.127373] [Hardware Error]: Machine check events logged

[12869.614938] [Hardware Error]: Machine check events logged

```

user was kicked off from GUI and in mcelog I found this:

```

root@buntu:~# more /var/log/mcelog 

mcelog: Unsupported new Family 6 Model 3c CPU: only decoding architectural errors

mcelog: failed to prefill DIMM database from DMI data

Kernel does not support page offline interface

mcelog: mcelog read: No such device

mcelog: Unsupported new Family 6 Model 3c CPU: only decoding architectural errors

Hardware event. This is not a software error.

MCE 0

CPU 1 BANK 0 

TIME 1380103761 Wed Sep 25 13:09:21 2013

MCG status:

MCi status:

Corrected error

Error enabled

MCA: Unknown Error 5

STATUS 90000040000f0005 MCGSTATUS 0

MCGCAP c09 APICID 2 SOCKETID 0 

CPUID Vendor Intel Family 6 Model 60

mcelog: Unsupported new Family 6 Model 3c CPU: only decoding architectural errors

Hardware event. This is not a software error.

MCE 1

CPU 2 BANK 0 

TIME 1380107608 Wed Sep 25 14:13:28 2013

MCG status:

MCi status:

Corrected error

Error enabled

MCA: Unknown Error 5

STATUS 90000040000f0005 MCGSTATUS 0

MCGCAP c09 APICID 4 SOCKETID 0 

CPUID Vendor Intel Family 6 Model 60

```

Can somebody explain this? Got not good help from Google, as related posts seemed to be in relation to overcloking. My computer is not overcloked. Fans are running normally and no overheating.

Motherboard: Gigabyte GA-Z87X-UD3H s1150

Memory: 4x 8G DDR3 1600C11 Kingston

CPU: Intel Core i7-4770K 3.5G/8M

----------

## Hypnos

Post the output of emerge --info

----------

## uraes

My "emerge --info" output is here : http://pastebin.ca/2458674

I don't think anymore, that it is purely Gentoo's problem, as this mcelog in my first post was produced under Estobuntu (Estonian version of Ubuntu) AND Gentoo livecd (20121221) was also unstable - three hangs in 24 hours. Just Gentoo is somehow more intense or active in some areas and crashes may happen in 30minutes.

----------

## ulenrich

CFLAGS=" -march=native -O2 -pipe "

CXXFLAGS=" -march=native -O2 -pipe "

Further you could try Gentoo~unstable release!

----------

## Hu

 *ulenrich wrote:*   

> CFLAGS=" -march=native -O2 -pipe "
> 
> CXXFLAGS=" -march=native -O2 -pipe "
> 
> Further you could try Gentoo~unstable release!

 The OP's current CFLAGS and CXXFLAGS are reasonable.  Adding -march=native might improve performance in some cases, but will not correct problems caused by failing hardware.  Suggesting that he switch to newer packages is also not helpful.  According to the mcelog output, there is a hardware fault.  The particular error claims to have been corrected, but there may be related errors that are not correctable.  The faulty component must be replaced.

----------

## Ant P.

It's happening on different cores, so either the CPU as a whole has some non-thermal issue like bad power, or you've got bad RAM.

----------

## uraes

This -march flag should make no difference, as this is default. And problem is not anymore only Gentoos as I was able to produce problems on other distro too. Question is - what does this mcelog mean? As I quess, this is some weird hardware problem?

I add also three images of traces I have been able to capture as computer hung up (and as seen - they are random and in pretty "weird" places):

http://picpaste.com/IMG_5822_s-Kj5CxlzB.JPG

http://picpaste.com/IMG_5966_s-rH0bWC6U.JPG

http://picpaste.com/IMG_5976_r-W0ML5OzV.JPG

----------

## uraes

 *Ant P. wrote:*   

> It's happening on different cores, so either the CPU as a whole has some non-thermal issue like bad power, or you've got bad RAM.

 

I'm trying now to monitor temperatures also, but shouldn't it affect fans also, e.g. they should run at maximum speed if CPU thiks that its too hot?

And with RAM - just made new run, with two chips removed (so, computer with 16G's)

----------

## eccerr0r

As MCELOG says, this is a hardware error.

Check your chipset to make sure it's not overheating too.  Sometimes I wonder about Gigabyte boards, they don't have fans on their chipsets but that heatsink gets quite hot.  (I have a Gigabyte Z68AP-D3 and EP43-UD3L boards, neither have fans on the chipset)

Checking with RAM chips removed was a good idea.

Since you have a K-series chip, try to underclock to see if it helps, especially try to see what it does if you reduce BCLK from 100MHz.  Also possibly increasing DRAM and/or chipset voltage.

It's weird that an ubuntu doesn't work, though you should try a stock ubuntu if you can.  Their optimizations tend to allow any CPU to work.

----------

## uraes

Just got call from warranty repairs.. motherboard was broken, changed to MSI. Gotta see, how it works  :Smile: 

Thanks of every bit of advice.

----------

## lecbee

Hello,

I have pretty much the same error, many many times:

TIME 1383039501 Tue Oct 29 10:38:21 2013

MCG status:

MCi status:

Corrected error

Error enabled

MCA: Unknown Error 5

STATUS 90000040000f0005 MCGSTATUS 0

MCGCAP c09 APICID 0 SOCKETID 0

CPUID Vendor Intel Family 6 Model 60

Hardware event. This is not a software error.

MCE 0

CPU 1 BANK 0

TIME 1383039550 Tue Oct 29 10:39:10 2013

MCG status:

MCi status:

Corrected error

Error enabled

MCA: Unknown Error 5

STATUS 90000040000f0005 MCGSTATUS 0

MCGCAP c09 APICID 2 SOCKETID 0

CPUID Vendor Intel Family 6 Model 60

This is on CentOS 6.4 x86-64

Motherboard: ASUS B85M-E s1150

Memory: 2x 4G DDR3 Crucial

CPU: Intel Core i7-4770 3.4G/8M

The mcelog is not up-to-date on CentOS, I recompiled it, and the "MCA: Unknown Error 5" in in fact a "MCA: Internal parity error" as you can see in this commit:

https://kernel.googlesource.com/pub/scm/utils/cpu/mce/mcelog/+/bec51ee686f29abd48c6ee4b67cff72135e80156%5E!/

Anyway that doesn't help to fix the error.

@uraes

Since you have your new motherboard, does that fix the problem?

----------

## kheper

I'm having the same errors as previous poster with a Xeon E3-1275v3 (Haswell) but it only happens when I'm running Virtualbox with IO-APIC enabled with FreeBSD/OpenBSD while compiling ports, if I disable IO-APIC, no errors. It also happen under VMware player with FreeBSD. I have yet to see those errors while not running a VM and it doesn't happen while running a Linux VM, for example, I have emerge world on Gentoo VM over 400 packages without a single error and I did it twice to be sure. After weeks of uptime, no MCE event of this kind if not in a VM context, memtest and prime95 torture tests runs without errors and I compiled various things on the Linux host, no errors. I'm using Kernel 3.12.

----------

## l3u

I'm also seeing machine check events when running a Windows SBS 2003 32 bit virtual machine with qemu on my Xeon E3 Haswell system. I found a thread about this on the vmware forums: https://communities.vmware.com/thread/452344 – but even after changing the qemu machine, I only got less machine check events, they were not gone.

I filed a bug about this in qemu's bugzilla: https://bugs.launchpad.net/qemu/+bug/1307225 – perhaps, somebody who experiences the same problems wants to confirm the problem.

----------

## pa1983

 *uraes wrote:*   

> Just got call from warranty repairs.. motherboard was broken, changed to MSI. Gotta see, how it works 
> 
> Thanks of every bit of advice.

 

The same happened to me on a K8WE tyan board with dual opteron 280 and 8x1Gb PC3200 ECC/REG. Was surfing when the systam locked up. Rebooted and was greted by a kernel crash saying it was a hardware error and no software error. After some testing I discovered that one memory channel on the board had broken

Removed both dimms in that channel and the kernel booted. If I added it back i got the same errors you had. Got my hands on a second K8WE board and both CPU and RAM worked in that. Tough that board died after the capacitors started leaking when I had it in storage so in the end I never realy got around to butting it back together other then for testing. Ended up getting new components.

----------

## l3u

But in contrast to the virtualization issue, this has been a real hardware problem …

----------

## hp3325

This is Intel erratum HSD131.  From http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/4th-gen-core-family-desktop-specification-update.pdf:

 *Quote:*   

> 
> 
> HSD131. Spurious Corrected Errors May be Reported
> 
> Problem: Due this erratum, spurious corrected errors may be logged in the IA32_MC0_STATUS
> ...

 

Ideally, the benign check events would be filtered in the kernel. At least in FreeBSD, the problem has already been addressed:

http://svnweb.freebsd.org/base?view=revision&revision=269052

```

/*

 * Skip spurious corrected parity errors generated by desktop Haswell

 * (see HSD131 erratum) unless reporting is enabled.

 * Note that these errors also have been observed with DO-stepping,

 * while the revision 014 desktop Haswell specification update only

 * talks about CO-stepping.

 */

 if (rec->mr_cpu_vendor_id == CPU_VENDOR_INTEL &&

   rec->mr_cpu_id == 0x306c3 && rec->mr_bank == 0 &&

   rec->mr_status == 0x90000040000f0005 && !intel6h_HSD131)

     return (1);

   return (0);

```

To turn off mce, ubuntu using the mce=ce_ignore kernel boot option, /etc/default/grub.cfg:

GRUB_CMDLINE_LINUX_DEFAULT="mce=ignore_ce"

for redhat, Add following item in /boot/grub/grub.conf

mce=mce=ignore_ce

----------

