# [SOLVED] Machine Check Exception on new Opteron server

## humbletech99

I set up a new amd64 gentoo server yesterday on an opteron but within a few hours of it being up I got a "Machine Check Exception" and the thing froze up. I had to go to the the local console to see this and then had to hard reboot the machine. It wasn't really doing much at the time other than compiling a couple of things. The server is a dual-cpu dual-core machine (4 cores that is) with 8GB ram and 12 SCSI disks + 2 satas for OS.

The error from the console is below:

```
HARDWARE ERROR

CPU 2: Machine Check Exception:                                    4 Bank 4:  f615200133000813

TSC 5ac60e50b6a ADDR 1d251ec00

This is not a software problem!

Run through mcelog --ascii to decode and contact your hardware vendor

Kernel panic - not syncing: Machine check
```

I have been googling around since yesterday but haven't found anything conclusive

I've tried running mcelog and got the following:

```
# mcelog --k8 /dev/mcelog

MCE 0

HARDWARE ERROR. This is *NOT* a software problem!

Please contact your hardware vendor

CPU 0 4 northbridge TSC a4d0cd72d5a8

ADDR 23c400000

  Northbridge GART error

       bit61 = error uncorrected

  TLB error 'generic transaction, level generic'

STATUS a40000000005001b MCGSTATUS 0

MCE 1

HARDWARE ERROR. This is *NOT* a software problem!

Please contact your hardware vendor

CPU 0 4 northbridge TSC a56b2eba7649

ADDR 23c400000

  Northbridge GART error

       bit61 = error uncorrected

  TLB error 'generic transaction, level generic'

STATUS a40000000005001b MCGSTATUS 0

MCE 2

HARDWARE ERROR. This is *NOT* a software problem!

Please contact your hardware vendor

CPU 0 4 northbridge TSC a60591585bda

ADDR 23c400000

  Northbridge GART error

       bit61 = error uncorrected

  TLB error 'generic transaction, level generic'

STATUS a40000000005001b MCGSTATUS 0

MCE 3

HARDWARE ERROR. This is *NOT* a software problem!

Please contact your hardware vendor

CPU 0 4 northbridge TSC a69ff2a635e8

ADDR 23c400000

  Northbridge GART error

       bit61 = error uncorrected

  TLB error 'generic transaction, level generic'

STATUS a40000000005001b MCGSTATUS 0

MCE 4

HARDWARE ERROR. This is *NOT* a software problem!

Please contact your hardware vendor

CPU 0 4 northbridge TSC a73a53f42ca9

ADDR 23c400000

  Northbridge GART error

       bit61 = error uncorrected

  TLB error 'generic transaction, level generic'

STATUS a40000000005001b MCGSTATUS 0

MCE 5

HARDWARE ERROR. This is *NOT* a software problem!

Please contact your hardware vendor

CPU 0 4 northbridge TSC a7d4b6934fdf

ADDR 23c400000

  Northbridge GART error

       bit61 = error uncorrected

  TLB error 'generic transaction, level generic'

STATUS a40000000005001b MCGSTATUS 0

MCE 6

HARDWARE ERROR. This is *NOT* a software problem!

Please contact your hardware vendor

CPU 2 4 northbridge TSC a86f17e0a6a8

ADDR 191b0b000

  Northbridge Chipkill ECC error

  Chipkill ECC syndrome = c12f

       bit46 = corrected ecc error

       bit62 = error overflow (multiple errors)

  bus error 'local node response, request didn't time out

      generic read mem transaction

      memory access, level generic'

STATUS d417c000c1080a13 MCGSTATUS 0

MCE 7

HARDWARE ERROR. This is *NOT* a software problem!

Please contact your hardware vendor

CPU 0 4 northbridge TSC a86f17e0c311

ADDR 23c400000

  Northbridge GART error

       bit61 = error uncorrected

  TLB error 'generic transaction, level generic'

STATUS a40000000005001b MCGSTATUS 0
```

Does anybody know anything about this?

----------

## Keruskerfuerst

What exact type of AMD processor and mainboard do you have?

----------

## humbletech99

The full specs from the purchase order say 

Processor     :    Dual AMD Opteron 275 2.2 GHz DUAL CORE, s940 1MB cache 64 bit (2 way)95watt 

Motherboard :    Tyan K8SRE,S2892G3NR,nForce4,1xPCI-e x16,1xPCI-e x4, 4 3xPCI-X,S-ATAII Raid, 2xGB, 8xDimm 

```
 # cat /proc/cpuinfo

processor       : 0

vendor_id       : AuthenticAMD

cpu family      : 15

model           : 33

model name      : Dual Core AMD Opteron(tm) Processor 275

stepping        : 2

cpu MHz         : 2200.000

cache size      : 1024 KB

physical id     : 0

siblings        : 2

core id         : 0

cpu cores       : 2

fpu             : yes

fpu_exception   : yes

cpuid level     : 1

wp              : yes

flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy

bogomips        : 4422.95

TLB size        : 1024 4K pages

clflush size    : 64

cache_alignment : 64

address sizes   : 40 bits physical, 48 bits virtual

power management: ts fid vid ttp

processor       : 1

vendor_id       : AuthenticAMD

cpu family      : 15

model           : 33

model name      : Dual Core AMD Opteron(tm) Processor 275

stepping        : 2

cpu MHz         : 2200.000

cache size      : 1024 KB

physical id     : 0

siblings        : 2

core id         : 1

cpu cores       : 2

fpu             : yes

fpu_exception   : yes

cpuid level     : 1

wp              : yes

flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy

bogomips        : 4420.51

TLB size        : 1024 4K pages

clflush size    : 64

cache_alignment : 64

address sizes   : 40 bits physical, 48 bits virtual

power management: ts fid vid ttp

processor       : 2

vendor_id       : AuthenticAMD

cpu family      : 15

model           : 33

model name      : Dual Core AMD Opteron(tm) Processor 275

stepping        : 2

cpu MHz         : 2200.000

cache size      : 1024 KB

physical id     : 1

siblings        : 2

core id         : 0

cpu cores       : 2

fpu             : yes

fpu_exception   : yes

cpuid level     : 1

wp              : yes

flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy

bogomips        : 4420.53

TLB size        : 1024 4K pages

clflush size    : 64

cache_alignment : 64

address sizes   : 40 bits physical, 48 bits virtual

power management: ts fid vid ttp

processor       : 3

vendor_id       : AuthenticAMD

cpu family      : 15

model           : 33

model name      : Dual Core AMD Opteron(tm) Processor 275

stepping        : 2

cpu MHz         : 2200.000

cache size      : 1024 KB

physical id     : 1

siblings        : 2

core id         : 1

cpu cores       : 2

fpu             : yes

fpu_exception   : yes

cpuid level     : 1

wp              : yes

flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy

bogomips        : 4420.41

TLB size        : 1024 4K pages

clflush size    : 64

cache_alignment : 64

address sizes   : 40 bits physical, 48 bits virtual

power management: ts fid vid ttp
```

```
# lspci

00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)

00:01.0 ISA bridge: nVidia Corporation CK804 ISA Bridge (rev a3)

00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2)

00:02.0 USB Controller: nVidia Corporation CK804 USB Controller (rev a2)

00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev a3)

00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f2)

00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)

00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)

00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev a2)

00:0d.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)

00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)

00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration

00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map

00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller

00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control

00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration

00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map

00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller

00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control

01:06.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)

01:08.0 Ethernet controller: Intel Corporation 82557/8/9 [Ethernet Pro 100] (rev 10)

08:0a.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)

08:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)

08:0b.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)

08:0b.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)

09:02.0 RAID bus controller: 3ware Inc 7xxx/8xxx-series PATA/SATA-RAID (rev 01)

09:03.0 PCI bridge: IBM PCI-X to PCI-X Bridge (rev 03)

0a:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID (rev 02)

0b:09.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 03)

0b:09.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 03)
```

----------

## Keruskerfuerst

 Northbridge Chipkill ECC error

  Chipkill ECC syndrome = c12f

       bit46 = corrected ecc error

       bit62 = error overflow (multiple errors)

  bus error 'local node response, request didn't time out

      generic read mem transaction

      memory access, level generic'

STATUS d417c000c1080a13 MCGSTATUS 0

MCE 7 

memory error

MCE 0

HARDWARE ERROR. This is *NOT* a software problem!

Please contact your hardware vendor

CPU 0 4 northbridge TSC a4d0cd72d5a8

ADDR 23c400000

  Northbridge GART error

       bit61 = error uncorrected

  TLB error 'generic transaction, level generic'

STATUS a40000000005001b MCGSTATUS 0 

mainboard error

HARDWARE ERROR. This is *NOT* a software problem!

Please contact your hardware vendor

CPU 2 4 northbridge TSC a86f17e0a6a8

ADDR 191b0b000

  Northbridge Chipkill ECC error

  Chipkill ECC syndrome = c12f

       bit46 = corrected ecc error

       bit62 = error overflow (multiple errors)

  bus error 'local node response, request didn't time out

      generic read mem transaction

      memory access, level generic'

STATUS d417c000c1080a13 MCGSTATUS 0 

memory error

I think, you should begin with replacing the mainbaord, then check the memory and if nesscary, replace the modules.

And at last there are also the CPUs.

Maybe, the power supply is defective.<--- you should check this firstLast edited by Keruskerfuerst on Fri Sep 22, 2006 2:16 pm; edited 2 times in total

----------

## humbletech99

oh come on! The memory and the mobo can't both be defective, it's a new machine. I'd bet the mobo is slightly defective instead, but I've read messages on a kernel mailing list about a guy who changed his mobo with another one of the exact same model and the same thing happened, it only stopped when he changed to a different brand of mobo, which would seem to indicate a subtle defect in design.

I haven't had this problem a second time despite me crunching all the disks simultaneous and compiling a fair amount of software as well...

perhaps it was a one-off fluke and I'll be ok....

or perhaps that's wishful thinking. This is supposed to be an important server when it goes into production (which should be any day now)

EDIT: actually, the GART to which the error refers was using a virtual IOMMU since I hadn't switched the IOMMU function on in the BIOS (I had something in the log complaining about this and so I switched on the IOMMU today see below)

```
kern-warning   2006-09-21 14:34:34   Checking aperture...

kern-warning   2006-09-21 14:34:34   CPU 0: aperture @ 0 size 32 MB

kern-warning   2006-09-21 14:34:34   No AGP bridge found

kern-warning   2006-09-21 14:34:34   Your BIOS doesn't leave a aperture memory hole

kern-warning   2006-09-21 14:34:34   Please enable the IOMMU option in the BIOS setup

kern-warning   2006-09-21 14:34:34   This costs you 64 MB of RAM

kern-warning   2006-09-21 14:34:34   Mapping aperture over 65536 KB of RAM @ 4000000
```

Therefore the error must have occurred entirely in RAM. So hopefully if there is a real hardware problem then it will be in the RAM which is more easily replaced.

I am going to leave it running memtest86 over the weekend to try to see if the memory is ok.

----------

## Keruskerfuerst

I had a mainboard in my computer, which was defective from the beginning.

----------

## feld

 *humbletech99 wrote:*   

> oh come on! The memory and the mobo can't both be defective, it's a new machine. 

 

I had a bad stick of ram in my first batch when I built my Opteron machine. I've had situations where both were bad, too. It happens more than you might think.

----------

## humbletech99

yeah I know, I've had loads of hardware problems both at home and at work over time. anyway, back to this stupid machine check exception. I will run memtest86 this weekend and feed the results back to the hardware supplier. I think I will have to have them come round to replace the ram at least and possibly one processor.

----------

## OldTango

 *humbletech99 wrote:*   

> yeah I know, I've had loads of hardware problems both at home and at work over time. anyway, back to this stupid machine check exception. I will run memtest86 this weekend and feed the results back to the hardware supplier. I think I will have to have them come round to replace the ram at least and possibly one processor.

 I have an older Tyan Tiger S2875 mobo with dual-opteron-246's on it.  I get the exact same memory errors as you are receiving.  I have ran memtest on this system for 2 days running different tests. ECC off and ECC on.  With ECC off absoultely zero errors were reported.  With ECC on 2 errors were reported on the 3rd time through.  Both errors were on bank0 and both were corrected.  They were not reported again on 6 more consecutive passes.

I assume you have solved the gart errors.......................................... :Question: 

As for the memeroy errors it is very possible you have some heat issues, which is what is happening in my case.  When I removed the side cover of my pc the errors dissappeared.  I took a few steps to improve cooling and that has helped a great deal with these errors.  I only receive them now when the system has been on for a few hours and being loaded heavaly, however the system never locks or crashes as a result of these errors.

My mobo is poorly designed and the cpu's sit to close together and to close to bank0 ram slot, making it difficult to get cpu coolers that will fit and do their job.  This is where most of the heat is generated.

A poor or bad power supply can also cause these errors.

This is a guess on my part, but the first 2 items on many check lists for these errors is heat and power.

----------

## humbletech99

actually it was due to the ram which is what i first suspected. Memtest didn't show any errors. I could only force the issue under heavy load. After changing the memory the issue disappeared and hasn't recurred for the last 2 months so I think it's safe to say that was the problem.

We changed the memory for a different brand as well.

----------

