# [Solved] MCE ECC error on the NB

## jpsollie

while idle, the computer is not logging anything specific.

When under load, the computer logs the following on a regular basis:

```
[173100.111033] mce: [Hardware Error]: Machine check events logged

[173100.111039] [Hardware Error]: Corrected error, no action required.

[173100.113189] [Hardware Error]: CPU:8 (15:1:2) MC4_STATUS[-|CE|MiscV|-|AddrV|-|CECC]: 0x9c67400040080a13

[173100.113449] [Hardware Error]: Error Addr: 0x00000008bfe59e80

[173100.113583] [Hardware Error]: MC4 Error (node 1): DRAM ECC error detected on the NB.

[173100.113874] EDAC MC1: 1 CE on mc#1csrow#1channel#1 (csrow:1 channel:1 page:0x8bfe59 offset:0xe80 grain:0 syndrome:0x40ce)

[173100.113875] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)

```

this looks like a hardware error.  Okay.  But is it on DRAM or on L3 cache that the exception occurs?

If on DRAM: is bank 15 the defective dram slot? -> dram slot is on the 2nd cpu, cpu8 is on the first cpu :-/

If on CPU: I should replace CPU0 to replace the L3 cache, right?

thanks

[Moderator edit: added [code] tags to preserve output layout. -Hu]

----------

## mike155

I'm not an expert, but here are my recommendations:

1) Collect additional data. The statements below will help you:

```
dmidecode --type memory

edac-util -v

```

Look at the number of corrected/uncorrected errors in the output of 'edac-util -v'.

If all values are 0, your DIMMs are probably fine, but your CPU may have a problem. 

If you get nonzero values for one DIMM, it's probably the DIMM which is faulty. 

If you get nonzero values for multiple (or even all) DIMMs, either your mainboard or the memory controller may be faulty.  

2) Look at the files in /sys/devices/system/edac/mc and subdirectories:

```
cd /sys/devices/system/edac/mc

tree .

```

The files and their contents will give you additional information.

3) emerge memtest86 and/or memtest86+, boot one of those images and run the memory tests. But make sure to TURN OFF ECC in memtest86/memtest86+. Otherwise memory errors will be corrected and the programs won't show you errors.

4) If you replace the faulty DIMM: don't throw it away. Label it as faulty and keep it. You can use it to test whether ECC works on a machine. Sometimes developers ask for faulty DIMMs to test their hardware and software.

Good Luck!

----------

## jpsollie

Hi Mike,

The DMIDECODE command does not show anything error-specific.  Are you also using version 3.1?

anyway, the edac-util did:

```

linuxserver backup # edac-util -v

mc0: 0 Uncorrected Errors with no DIMM info

mc0: 0 Corrected Errors with no DIMM info

mc0: csrow0: 0 Uncorrected Errors

mc0: csrow0: mc#0csrow#0channel#0: 0 Corrected Errors

mc0: csrow0: mc#0csrow#0channel#1: 0 Corrected Errors

mc0: csrow1: 0 Uncorrected Errors

mc0: csrow1: mc#0csrow#1channel#0: 0 Corrected Errors

mc0: csrow1: mc#0csrow#1channel#1: 0 Corrected Errors

mc0: csrow2: 0 Uncorrected Errors

mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors

mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors

mc0: csrow3: 0 Uncorrected Errors

mc0: csrow3: mc#0csrow#3channel#0: 0 Corrected Errors

mc0: csrow3: mc#0csrow#3channel#1: 0 Corrected Errors

mc1: 0 Uncorrected Errors with no DIMM info

mc1: 0 Corrected Errors with no DIMM info

mc1: csrow0: 0 Uncorrected Errors

mc1: csrow0: mc#1csrow#0channel#0: 0 Corrected Errors

mc1: csrow0: mc#1csrow#0channel#1: 0 Corrected Errors

mc1: csrow1: 0 Uncorrected Errors

mc1: csrow1: mc#1csrow#1channel#0: 0 Corrected Errors

mc1: csrow1: mc#1csrow#1channel#1: 64 Corrected Errors

mc1: csrow2: 0 Uncorrected Errors

mc1: csrow2: mc#1csrow#2channel#0: 0 Corrected Errors

mc1: csrow2: mc#1csrow#2channel#1: 0 Corrected Errors

mc1: csrow3: 0 Uncorrected Errors

mc1: csrow3: mc#1csrow#3channel#0: 0 Corrected Errors

mc1: csrow3: mc#1csrow#3channel#1: 0 Corrected Errors

mc2: 0 Uncorrected Errors with no DIMM info

mc2: 0 Corrected Errors with no DIMM info

mc2: csrow0: 0 Uncorrected Errors

mc2: csrow0: mc#2csrow#0channel#0: 0 Corrected Errors

mc2: csrow0: mc#2csrow#0channel#1: 0 Corrected Errors

mc2: csrow1: 0 Uncorrected Errors

mc2: csrow1: mc#2csrow#1channel#0: 0 Corrected Errors

mc2: csrow1: mc#2csrow#1channel#1: 0 Corrected Errors

mc2: csrow2: 0 Uncorrected Errors

mc2: csrow2: mc#2csrow#2channel#0: 0 Corrected Errors

mc2: csrow2: mc#2csrow#2channel#1: 0 Corrected Errors

mc2: csrow3: 0 Uncorrected Errors

mc2: csrow3: mc#2csrow#3channel#0: 0 Corrected Errors

mc2: csrow3: mc#2csrow#3channel#1: 0 Corrected Errors

mc3: 0 Uncorrected Errors with no DIMM info

mc3: 0 Corrected Errors with no DIMM info

mc3: csrow0: 0 Uncorrected Errors

mc3: csrow0: mc#3csrow#0channel#0: 0 Corrected Errors

mc3: csrow0: mc#3csrow#0channel#1: 0 Corrected Errors

mc3: csrow1: 0 Uncorrected Errors

mc3: csrow1: mc#3csrow#1channel#0: 0 Corrected Errors

mc3: csrow1: mc#3csrow#1channel#1: 0 Corrected Errors

mc3: csrow2: 0 Uncorrected Errors

mc3: csrow2: mc#3csrow#2channel#0: 0 Corrected Errors

mc3: csrow2: mc#3csrow#2channel#1: 0 Corrected Errors

mc3: csrow3: 0 Uncorrected Errors

mc3: csrow3: mc#3csrow#3channel#0: 0 Corrected Errors

mc3: csrow3: mc#3csrow#3channel#1: 0 Corrected Errors

```

so this means the module at memory controller 1 (which is CPU 0, as bulldozer has 2 MCs on one CPU) on csrow 1 channel 1 is faulty.  Right?

If so, I'll buy 2 new DDR3 modules (one in spare) and swap them.

Thank you for the information!

----------

## jpsollie

I ordered a new ram module and the errors disappeared.  Problem solved

----------

