# MCEs and random system freezes [resolved]

## Perennial

Hi all,

Since last monday (after I had just joined these forums  :Twisted Evil: ) my system started behaving oddly. It started freezing up randomly. And it happens in different environments (during boot, in X, no X) and when burdened with different tasks (compiling, just browsing, even with only bash opened).

The only solution to these freezes, seemed to be a hard reset, as the keyboard doesn't work anymore (not even numlock), nor does the mouse. Even the blinking cursor stops blinking!!! So it is a full freeze. Hence I don't have any message on what occurs on that very moment....

I started to try different kernels, as I always have a plethora of kernels available (gentoo-dev-sources-2.6.8,2.6.10,2.6.11)... 

But to no avail: They all suddenly started to give this behaviour. 

And moreover in dmesg I now always find a continuous stream of messages like:

```
MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.

Bank 1: d000000000000171

MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.

Bank 1: 9000000000000171

MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.

Bank 1: 9000000000000171

MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.

Bank 1: 9000000000000171

MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.

Bank 1: 9400000000000151

MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.

Bank 1: 9000000000000171

MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.

Bank 1: 9000000000000171

MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.

Bank 1: 9400000000000151

```

I had never seen these before. So I think the system lockups and these MCEs are related....

MCE is Machine check exception. I have it enabled in all my kernels.

My cpu is a AMD Duron. Here is what /proc/cpuinfo has to say about it:

```
processor       : 0

vendor_id       : AuthenticAMD

cpu family      : 6

model           : 3

model name      : AMD Duron(tm) processor

stepping        : 0

cpu MHz         : 750.417

cache size      : 64 KB

fdiv_bug        : no

hlt_bug         : no

f00f_bug        : no

coma_bug        : no

fpu             : yes

fpu_exception   : yes

cpuid level     : 1

wp              : yes

flags           : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr pni syscall mmxext 3dnowext 3dnow

bogomips        : 1482.75

```

There is a tool to decode the MCEs, but I don't know how to use it. There seems to be no real doc.

The tool is called parsemce. You can find the source code by googling for parsemce.c

I now have this tool, but I don't know how to use it properly. Here is the usage info:

```
usage:  parsemce [options]

  options:  -V <version number>

            -e <MCE status code>

            -b <bank number>

            -s <bank status code>

            -a <bank address>

            -f <filename, with MCE dump inside>

            -i <get MCE dump from stdin>

```

The thing is, it seems to require a MCE status code. But I don't have any. IMHO the numbers I have in dmesg are the bank status code. So if I pick 0 for MCE status, I get something like:

```
./parsemce -e 0 -b 1 -s 9000000000000171 -a 0

Status: (0) Restart IP invalid.

parsebank(1): 9000000000000171 @ 0

        External tag parity error

        Error enabled in control register

        Memory heirarchy error

        Request: Generic error

        Transaction type : Instruction

        Memory/IO : Reserved

```

But as 0 was just a guess, I thought I'd try some different number to see how it would influence the result. Here is what I got when I picked 1:

```
./parsemce -e 1 -b 1 -s 9000000000000171 -a 0

Status: (1) Restart IP valid.

parsebank(1): 9000000000000171 @ 0

        External tag parity error

        Error enabled in control register

        Memory heirarchy error

        Request: Generic error

        Transaction type : Instruction

        Memory/IO : Reserved

```

Which is the complete opposite!

If i use the number as MCE status, I get :

```
./parsemce -e  9000000000000171

Status: (-8070450532247928463) Restart IP valid.
```

Which is obviously wrong, so I ask: How does one use this tool correctly?

Has anybody any success in decoding the MCEs??

Thanks in advance.

Oh BTW I know I have some bad bits in my memory dimms, but that is why I have my kernels patched with the badram-patch.

So memtest86 does give errors, but these are known errors to me. I thought that perhaps a new memory error showed up, but until now I haven't been able to track it down. Also previously when using a kernel not badram-patched with this bad memory, my system would not freeze, but get in a loop of Oopses, which was sometimes finite and sometimes not....

Any hint is greatly appreciated.

P.

Phew I managed to type all this without my system locking up  :Very Happy: Last edited by Perennial on Sat Mar 19, 2005 11:26 am; edited 2 times in total

----------

## shadow255

My first thought on seeing your problem is that you may have a different hardware problem.  System hard-locks quite frequently are the result of CPU overheating.  Have you been monitoring your motherboard and CPU sensors output to see what sort of temperatures your components are reaching?  After such a hard-lock, I'd try checking in the CMOS setup program for your system's BIOS to see what temperatures and fan speeds are being reported if you don't have lm-sensors configured.  Good luck, whatever you find.

----------

## Qdot

```
parsebank(1): 9000000000000171 @ 0

        External tag parity error

        Error enabled in control register

        Memory heirarchy error

        Request: Generic error

        Transaction type : Instruction

        Memory/IO : Reserved 
```

-e specifies the status code; and it's either valid or invalid, nothing fancy.

the error numer: 9000000000000171 in both cases gives you the same answer, and it should be so.

I'd say it's memory parity problem, either due to overclocking or bad ram chip/bus, and for some reason system tries to access the bad part of it.

I'd recommend testing with a working ram for a day or so, and than trying to increase the badram patterns by (at least) +- your L2 cache size (or, if you really know better, maximum refill size...) - it's quite common with badram patch, that altough kernel will not attempt to access the bad part directly, a cache refill starting just below the bad area may enter it and cause weird side effects.

----------

## Perennial

Hi,

 *shadow255 wrote:*   

> System hard-locks quite frequently are the result of CPU overheating. Have you been monitoring your motherboard and CPU sensors output to see what sort of temperatures your components are reaching?

 I have always been monitoring my CPU temp/ fan speeds with sys-apps/lm_sensors. 

Temps never seems to go beyond 48C, even under load I only have it reaching something like 45-46C. 

I also use sys-apps/athcool, to cool the cpu down when not used. 

So I doubt it may be overheating, but I am concerned that it may be a defect in the cpu (as I don't have a backup cpu...). 

Hence my strong interest in decoding the MCEs. 

Thanks for your reply anyway.

 *Qdot wrote:*   

> I'd say it's memory parity problem, either due to overclocking or bad ram chip/bus, and for some reason system tries to access the bad part of it. 

 Aha, so memory problem after all.

 *Qdot wrote:*   

>  and than trying to increase the badram patterns by (at least) +- your L2 cache size (or, if you really know better, maximum refill size...) 

 Interesting! I didn't know this. Thank you for your hint! 

So what should I take? Currently my mem config is 1x128MB and 1x256MB of SDRAM.

I believe all errors I have had are located on the 256MB dimm, which should come after the good 128MB in the linear address space if all is good.

The badram patterns I have are (only 2 thusfar  :Smile: ):

```
badram=0x0e2e51e8,0xfffffffc,0x0e78d4b8,0xfffffffc
```

So you suggest increasing these masks? What should I take then? Is the 64K reported in /proc/cpuinfo the L2 cache size?

Something like:

```
badram=0x0e2e51e8,0xfffffff8,0x0e78d4b8,0xfffffff8
```

perhaps?

----------

## Perennial

Hi all,

I finally managed to get my system booting up again after it had crashed this morning. I took me more than 20 tries, it always locked up during boot now.

I even had set the badram masks much wider (mask=0xffff0000) , but to no avail.

I kept on getting these MCEs and lock ups, and now even oopses during boot.

As a final resort I decided to underclock my CPU and relax all memory timings. 

And it wasn't until then that I was able to boot it up ... Weird. 

It had been working fine in this config for more than 4 years now....

I still wonder: Are these MCEs I had, caused by a faulty CPU/ faulty dimm, or impossible to tell?

If there is anybody out there who is able to tell me how to decode these MCEs correctly, I'm still all ears. 

As for now I'll see how long the system will keep up. I'll keep you posted.

Thanks for all support thusfar.

----------

## Perennial

I think these MCEs were caused by my CPU. As I said earlier I had to underclock my cpu to get my system stable again. Very odd indeed. Then the MCEs were gone along with the instability.

Previously I also had memory errors (that is why I used badram) for quite some time now. 

I was able to get rid of those as well by underclocking my memory!

Seems my hardware is thinking about taking it easy?!

Hopefully I am not going to have to lower the speeds any further, as compiling now already takes noticeably longer. :Crying or Very sad: 

My system is stable now since I underclocked it, not experienced a single freeze since.

Marking thread [resolved].

----------

## Fourbissime

weird, got almost exactly the same kind of problem starting the same day. Thanks for your outputs on this, maybe it will help me to troubleshoot my system.

----------

## Yanaar

On my brand new amd64 box the reason for MCEs was the IDE cable.

Typing accidently (instead of SysQR combination)  ALT-GR and the button right next to print (role?? I'm german  :Laughing:  ) revealed something which could be the image of ram just before it dies and there was always ide and ide-dma mentioned.

Changing the cable solved my problem, but it took me nearly 2 months to find that out.

Maybe this helps someone who's searching for MCE   :Laughing: 

----------

