# MCE reporting ECC corrected errors, memtest cant find any

## MorpheuS.Ibis

Hi,

one of my machines (a 2-slot opteron server) is producing a lot of EDAC errors like this one

```
[Hardware Error]: CPU:0  MC4_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc00c00000080813

[Hardware Error]:        MC4_ADDR: 0x00000004148e14d0

[Hardware Error]: Northbridge Error (node 0): DRAM ECC error detected on the NB.

EDAC amd64 MC0: CE ERROR_ADDRESS= 0x4148e14d0

EDAC MC0: CE page 0x4148e1, offset 0x4d0, grain 0, syndrome 0x1, row 7, channel 0, label "": amd64_edac

[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)

```

weirdly, memtest did not reveal anything, with ECC turned either on or off

then i tried to swap all the modules connected to one CPU with the modules connected to the other CPU, but the messages did not change (as in still reports fails on "node 0")

my conclusion would be that one of the CPUs (or the motherboard) in the system is shot, would that be the correct assessment?

----------

## JanR

Hi,

 *Quote:*   

> 
> 
> EDAC MC0: CE page 0x4148e1
> 
> 

 

Thats a correctable error so memtest in ECC mode cannot find it. With no ECC however, it SHOULD find it except the problem is on one the ECC bits (these are not accessed with ECC disabled).

I guess you executed memtest for a sufficient time (hours to days)?

As next step I would swap the CPUs. If the error then goes to the other node, it is the memory controller inside the CPU. If not, it could be mainboard or (IMO more likely) DRAM socket (have you checked and cleaned it)?

Greetings,

Jan

----------

## MorpheuS.Ibis

AFAIK, memtest has a feature to detect corrected ECC errors. i left it running for several complete tests (~2 days) with this feature both enabled and disabled

I have not tried running memtest with ECC disabled in BIOS, nor swapping the CPUs, nor using different DIMMs (only half are populated, so that would be easy), ill try that the next chance i get

thanks for the tips

----------

## JanR

Hi,

such problems are really hard to track down.

We once had an old machine (dual P3 1400 in an IBM eServer) that crashed the hard way (complete destruction of file system). I used that opportunity to make it Gentoo (was Debian before) and restored the backup of our data. Machine was fine for some weeks, then crash. Next I changed the disks into a spare machine of the same type (only less memory and only one CPU) and started testing this thing.

Memtest found errors after appr. 5 days of running (!).

Next I swapped memory (we had a third machine of this type so I could use parts from there) - problem stayed (again after several days of memtest).

Then CPUs (although the P3 has no onchip MC, there could be a problem with cache) - problem stayed.

CPUs and memory was all tested good on the third machine.

Altogether this was two month of testing with the final conclusion "something on chipset or mainboard". The machine was 8 years at this time so this was more for academic interest then really needed...

Greetings,

Jan

----------

