# mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: d4000

## slick

During large rsync jobs (> 5TB of million misc files) with a lot I/O I got this sometimes (but seldom). What is happen here?

Happen only with rsync. Not on cp or mv. Filesystem is zfs over plain dm-crypt.

```

[52925.283857] mce: [Hardware Error]: Machine check events logged

[52925.283862] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: d400008000910091

[52925.283867] mce: [Hardware Error]: TSC 0 ADDR 3e734f1e8 

[52925.283872] mce: [Hardware Error]: PROCESSOR 0:406d8 TIME 1557098412 SOCKET 0 APIC 0 microcode 121

[52925.283875] mce: [Hardware Error]: Machine check events logged

[52925.283877] mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 5: d400008000910091

[52925.283879] mce: [Hardware Error]: TSC 0 ADDR 3e734f1e8 

[52925.283883] mce: [Hardware Error]: PROCESSOR 0:406d8 TIME 1557098412 SOCKET 0 APIC 2 microcode 121

```

As I google I found some command to analyse it, but I can't understand whats telling me.

```
# ras-mc-ctl --summary

No Memory errors.

No PCIe AER errors.

No Extlog errors.

MCE records summary:

   10 MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error errors

```

```

# ras-mc-ctl --errors 

No Memory errors.

No PCIe AER errors.

No Extlog errors.

MCE events:

1 2019-05-04 08:26:35 +0200 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error Error_enabled, n_errors=1, mcgcap=0x00000806, status=0x9400004000910091, addr=0x3e734f1c0, walltime=0x5ccd309a, cpuid=0x000406d8, bank=0x00000005

2 2019-05-04 08:26:35 +0200 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error Error_enabled, n_errors=1, mcgcap=0x00000806, status=0x9400004000910091, addr=0x3e734f1c0, walltime=0x5ccd309a, cpu=0x00000001, cpuid=0x000406d8, apicid=0x00000002, bank=0x00000005

3 2019-05-04 09:54:47 +0200 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error Error_enabled, n_errors=1, mcgcap=0x00000806, status=0x9400004000910091, addr=0x3e734f1c0, walltime=0x5ccd4546, cpu=0x00000002, cpuid=0x000406d8, apicid=0x00000004, bank=0x00000005

4 2019-05-04 09:54:47 +0200 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error Error_enabled, n_errors=1, mcgcap=0x00000806, status=0x9400004000910091, addr=0x3e734f1c0, walltime=0x5ccd4546, cpu=0x00000003, cpuid=0x000406d8, apicid=0x00000006, bank=0x00000005

5 2019-05-04 12:56:22 +0200 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error Error_enabled, n_errors=2, mcgcap=0x00000806, status=0xd400008000910091, addr=0x3e734f1c0, walltime=0x5ccd6fd6, cpu=0x00000001, cpuid=0x000406d8, apicid=0x00000002, bank=0x00000005

6 2019-05-04 13:22:19 +0200 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error Error_enabled, n_errors=1, mcgcap=0x00000806, status=0x9400004000910091, addr=0x3e734f1d0, walltime=0x5ccd75ea, cpu=0x00000001, cpuid=0x000406d8, apicid=0x00000002, bank=0x00000005

7 2019-05-04 15:42:24 +0200 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error Error_enabled, n_errors=1, mcgcap=0x00000806, status=0x9400004000910091, addr=0x3e734f1c0, walltime=0x5ccd96bf, cpu=0x00000002, cpuid=0x000406d8, apicid=0x00000004, bank=0x00000005

8 2019-05-05 11:14:31 +0200 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error Error_enabled, n_errors=1, mcgcap=0x00000806, status=0x9400004000910091, addr=0x3e734f1c0, walltime=0x5ccea977, cpu=0x00000004, cpuid=0x000406d8, apicid=0x00000008, bank=0x00000005

9 2019-05-06 01:20:12 +0200 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error Error_enabled, n_errors=2, mcgcap=0x00000806, status=0xd400008000910091, addr=0x3e734f1e8, walltime=0x5ccf6fac, cpuid=0x000406d8, bank=0x00000005

10 2019-05-06 01:20:12 +0200 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error Error_enabled, n_errors=2, mcgcap=0x00000806, status=0xd400008000910091, addr=0x3e734f1e8, walltime=0x5ccf6fac, cpu=0x00000001, cpuid=0x000406d8, apicid=0x00000002, bank=0x00000005

```

Is my memory broken or is this just an information that the ECC correct an error?  (Yes, it's ECC RAM)

CPU is:

```
# cat /proc/cpuinfo 

processor   : 0

vendor_id   : GenuineIntel

cpu family   : 6

model      : 77

model name   : Intel(R) Atom(TM) CPU  C2750  @ 2.40GHz

stepping   : 8

microcode   : 0x121

cpu MHz      : 2599.865

cache size   : 1024 KB

physical id   : 0

siblings   : 8

core id      : 0

cpu cores   : 8

apicid      : 0

initial apicid   : 0

fpu      : yes

fpu_exception   : yes

cpuid level   : 11

wp      : yes

flags      : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer aes rdrand lahf_lm 3dnowprefetch cpuid_fault epb tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms dtherm ida arat

bugs      : cpu_meltdown spectre_v1 spectre_v2

bogomips   : 4799.73

clflush size   : 64

cache_alignment   : 64

address sizes   : 36 bits physical, 48 bits virtual

power management:

... 8 Cores

```

----------

## mike155

 *Quote:*   

> Is my memory broken or is this just an information that the ECC correct an error? (Yes, it's ECC RAM)
> 
> 

 

Looks like a memory error which was corrected by ECC logic. I would replace the faulty DIMM as soon as possible.

What does edac-util tell you?

```
edac-util -v

```

----------

## bunder

are you overclocking your memory?  one thing you could try is turning off XMP in the BIOS.

----------

## slick

 *mike155 wrote:*   

>  *Quote:*   Is my memory broken or is this just an information that the ECC correct an error? (Yes, it's ECC RAM)
> 
>  
> 
> Looks like a memory error which was corrected by ECC logic. I would replace the faulty DIMM as soon as possible.
> ...

 

How do I identfy the broken RAM-Module? There are 4 installed.

Fresh installed it say nothing. Do I have to wait for next crash first?

```

# edac-util -v 

edac-util: Error: No memory controller data found.

```

 *bunder wrote:*   

> are you overclocking your memory?  one thing you could try is turning off XMP in the BIOS.

 

No overclocked. Defaults as much as possible.

----------

## NeddySeagoon

slick,

Boot into memtest86 and run a few cycles.

You must boot into it as running it through the kernels memory manager will only tell that you have a fault, not where.

You need several cycles. The same error at the same address indicates that its probably a RAM error.

Random errors only tell that its a memory subsystem error.

----------

