# System occaisonally hangs w/ no trace.  HW problem?

## john.newman

Over the past month I've been working to get this PC updated a bit.  It's been a long list of mini projects.  Everything is going well, except ...   :Razz: 

I sadly get a full system hang from time to time.  Everything stops.  If I have music playing it will stutter.  Even alt+sysrq doesn't work, I have to go to the power button.   :Mad:   There seems to be no pattern for when this happens, idle or varying workloads.  It's sporadic and occaisonal - twice yesterday, the day before smooth, once the day before that, etc.

If I had a trace to look at that would be helpful. Can anyone explain how I can go about trying to get a trace (if possible)? I do not have a serial console.  

I've changed a lot since it was stable - so it could be a hardware problem, new bios version problem, bios setting problem (there's a lot going on here), or new kernel problem.  The previous "build" of this machine was on 3.12 and was fine.  With the new build, I've been on 4.0, 4.1, 4.2, 4.3 and they all hang.  dmesg looks quite good and otherwise everything is functioning very well.  The long smartmon test passes on each drive. There is no dust, and the machine is properly watercooled with good airflow and only touches 43C under load. Bios overclock type settings are all stock.

But I did have this show up in MCELOG two days ago at a random time, but there was no crash at all that particular day.  I ran memtest overnight and there were 0 errors.  I've switched the DIMMs in bank 6 and 7 so that may provide future info if the slot or DIMM is bad.

```
Hardware event. This is not a software error.

MCE 0

CPU 8 BANK 7

MISC 123a4100180a87

TIME 1449180273 Thu Dec  3 17:04:33 2015

MCG status:

MCi status:

Corrected error

MCi_MISC register valid

MCA: BUS Level-3 Generic Generic Other-transaction Request-did-not-timeout Error

QPI

STATUS 8800004000310e0f MCGSTATUS 0

MCGCAP 1000c14 APICID 20 SOCKETID 1

CPUID Vendor Intel Family 6 Model 45

Hardware event. This is not a software error.

MCE 1

CPU 8 BANK 7

MISC 123a4200180a19

TIME 1449180279 Thu Dec  3 17:04:39 2015

MCG status:

MCi status:

Corrected error

MCi_MISC register valid

MCA: BUS Level-3 Generic Generic Other-transaction Request-did-not-timeout Error

QPI

STATUS 8800004000310e0f MCGSTATUS 0

MCGCAP 1000c14 APICID 20 SOCKETID 1

CPUID Vendor Intel Family 6 Model 45

Hardware event. This is not a software error.

MCE 2

CPU 8 BANK 7

MISC 12194000105e05 ADDR 25a2700

TIME 1449180292 Thu Dec  3 17:04:52 2015

MCG status:

MCi status:

Corrected error

MCi_MISC register valid

MCi_ADDR register valid

MCA: BUS Level-3 Generic Generic Other-transaction Request-did-not-timeout Error

QPI

STATUS 8c00004000310e0f MCGSTATUS 0

MCGCAP 1000c14 APICID 20 SOCKETID 1

CPUID Vendor Intel Family 6 Model 45
```

Do I have an actual RAM issue and memtest is not finding it?  Is there a good CPU / PSU tester program from linux?  And again how can I see a trace if there is one while this hang happens?  It would be great to fix this.  Any suggestions are very appreciated.  I vaguely remember having this same problem in the very distant past with a different board, but I was able to fix it, I want to say it was something with an ALSA module....

----------

## Logicien

There is no reason for the moment to suspect the ram and the cpu to be related to the freeze because Gentoo was stable with version 3 of the Linux kernel. Unless the system freeze with any operating system, it look to be a software problem.

Check your logs including dmesg before the system freeze to see if there are warnings, errors, oops, etc who can give you hints of the causes. Some kernel parameters stop freezes in some situations.

You can try to compile your kernel using the default options of Genkernel. Personalised kernels are touched to configure well and something missing and/or something to much can make them disfonctionnal. Using a .config file from version 3 and use it with version 4 of the Linux kernel need to do make oldconfig.

It's important for you to know if the version 4 of the Linux kernel is in cause and/or if it is your kernel configuration and/or something else. With Funtoo, you can install a precompiled Linux kernel from Debian, Ubuntu and other distributions. This can give informations depending on how they beheave.

If the material have change a lot since Gentoo was stable it may be a good idea to review your make.conf configuration and emerge --empty-tree to recompile all your packages, but this take a long time.

When a freeze occur, the first I do is to run the cpu at the performance governor with no frequency scaling and to disable any powersaving of the material. I am thinking about disabling memory swap space too. No overclocking is a rule.

Remember that Adobe Flash with hardware acceleration on in it's configuration is good to make the system freeze. Does the freezes arrive only using X and/or in the console?

----------

## john.newman

Thanks for the reply.  Of course since posting that the machine was up and stable all weekend.  It will happen again though.

I've had the same gentoo install since around 2004.  I've upgraded the kernel with oldconfig probably a dozen times.  I've changed disks and motherboards in place three or four times, but really it's been the same "install" the entire time.  The config I have for this board is pretty good I think, I've gone through it with a fine tooth comb quite a bit.  I've been doing my own kernel builds for almost 15 years so I'm mostly comfortable with what's in there.  I'm fairly certain it is not a kernel config issue.  Nor do I really suspect a severe issue in kernel 4.x.  But I haven't ruled that out yet.

I suspect a hardware problem as I had the entire machine apart for cleaning, and I may have damaged something sensitive.  I was careful but it's possible something broke.  Also the MCELOG errors are really hinting at something ... but memtest found nothing in 7 hours.  Maybe memtest for 3 days will.  

I think what I need to try next is somehow BOOT a certain other OS (ewww) and do some stress testing there to see what happens.  I'll have to find an old HD and throw something together.  That will rule a lot of things out very quickly.

But I swear I had this same problem with my previous board back around 2011 and was able to fix it.  I vaguely remember it was something with ALSA.  It seems to happen more often with sound playing but also with sound off.  I do not have flash installed.  I wish I could remember what that was before....    :Evil or Very Mad: 

I have cat /proc/kmsg open all the time and when the machine hangs ... it hangs.  I'll have to setup the kdump which I had a long time ago to see if any trace is available.

----------

## szatox

Just a hint that may or may not be useful... Remove kernel modules you don't use. Support for hardware you don't have _should_ not hurt, however my wifi suddently started working again after I dropped a bunch of drivers for other, not related devices. This clearly means some interferences occurred in some place I did not think of.

In your case it could have been ALSA. Or, perhaps, a driver for some chipset you don't have.

----------

## Tassie_Tux

 *john.newman wrote:*   

> 
> 
> But I did have this show up in MCELOG two days ago at a random time, but there was no crash at all that particular day.  I ran memtest overnight and there were 0 errors.  I've switched the DIMMs in bank 6 and 7 so that may provide future info if the slot or DIMM is bad.
> 
> 

 

I am experiencing similar machine check errors using similar processors. The good news is that I believe that these errors are (fortunately) 'false positives' relating to a known hardware fault with our Processors. The bad news is that it therefore does not explain your system hang issue.

The following is a lengthy explanation as to how I came to this conclusion. In brief, an appropriate BIOS update could suppress these machine check errors.

My situation

I have just assembled a dual Xeon E5-2670 based system and have moved my existing install to this system. This and the previous system are both using the Sandy Bridge microarchitecture (CFLAGS="-march=sandybridge") so for the migration to work I simply tweaked my kernel for the changed hardware. It worked! Having recently upgraded glibc and gcc I took the upgrade as an opportunity to test the new hardware with a rebuild of everything (emerge -e @world). It has been during periods of heavy CPU load that I see these errors via mcelog. They seem to be identical to yours:

```

Hardware event. This is not a software error.

MCE 0

CPU 8 BANK 7 

MISC 1df87b000d9eff 

TIME 1460296528 Sun Apr 10 23:55:28 2016

MCG status:

MCi status:

Corrected error

MCi_MISC register valid

MCA: BUS error: 1 8 Level-3 Generic Generic Other-transaction Request-did-not-timeout

Running trigger `bus-error-trigger'

QPI

STATUS 8800004000310e0f MCGSTATUS 0

MCGCAP 1000c14 APICID 20 SOCKETID 1 

CPUID Vendor Intel Family 6 Model 45

```

My system does not hang, compilations do not fail and I am not noticing any performance degradation. The fact that these errors are 'corrected errors' is somewhat reassuring but not enough to eliminate my anxiety!

Having no success with web/forum searches I reviewed the Xeon E5 documentation from Intel. The document Intel Xeon E5 Product Family Specifications Update (February 2016 Revision 019) lists errata relating to these processors. There was one erratum on page 41 that drew my attention

BT65. Intel QPI Interface Calibration May Log Spurious Bus and Interconnect Error Machine Checks

Problem:

The Intel QPI interface Physical Layer performs calibration across all 20 of the lanes and report the success or failure of the calibration process. Due to this erratum, the processor may detect spurious errors during the calibration of the Intel QPI interface. The bus and interconnect errors are reported with the IA32_MCi_STATUS.MCACOD (bits [15:0]) with a value of 0000_1xx0_0000_1111 (where x is zero of one).

Implication:

The processos may log spurious bus and interconnect error machien check reports during Intel QPI calibration.

Workaround:

is possible for the BIOS to contain a workaround for this erratum. A BIOS code change has been identified and may be implemented as a workaround for this erratum.

Status:

For the affected steppings, see the Summary Table of Changes.

Section 15.9 of the document Intel 64 and IA-32 Architectures Software Developer's Manual Volume 3 - System Programming Guide relates to the interpretation of machine check error codes ("Interpreting The MCA Error Codes"). Erratum BT65 suggests that the binary code would be either 0000_1000_0000_1111 or 0000_1110_0000_1111.

Using the tables under section 15.9 for Bus and Interconnect errors I interpret the first code (0000_1000_0000_1111) as meaning

Compound Error Code: 'Bus and Interconnect Error'

Correction Report Filtering: 'Normal' (which only applies to corrected errors)

Level: 'Generic'

Request: 'Generic Error'

Bus and Interconnect Errors: PP='Local processor originated request', T='Request did not time out', II='Other transaction'

with the second binary code having identical meaning except for PP='Generic'.

My interpretation of these BT65 codes is that they are consistent with what I am seeing in /var/log/mcelog

```
MCA: BUS Level-3 Generic Generic Other-transaction Request-did-not-timeout Error

QPI
```

My processor model and stepping (C2) is in the list of processors affected by this hardware bug.

Due to this information and the fact that these MCEs do not correspond with any obvious problem (e.g. system hangs) gives me confidence in believing that these MCEs relate to BT65 and are therefore 'false positives'.

I am yet to try a BIOS update as a way of suppressing these errors. My preference would be to instead investigate a way to configure mcelog so that they are not logged.   :Cool: 

----------

