# Hardware Error?

## Zepp

My computer crashed several times today, and it happened in windows and linux. Anyway I was suspecting a hardware error possibly. Well it crashed again in Linux and went I tried to shut it down and bring it back up I got this

```

...

HARDWARE ERROR

CPU 0: Machine Check Exception         4 Bank 4: b200000000070f0f

TSC 1b51cc9213

This is not a software problem!

Run through mcelog --asci to decode and contact your hardware vendor

Kernel Panic - not syncing: Machine check
```

Umm what is broken? Did my computer just go up in a ball of flames?  :Neutral: 

----------

## Akkara

It is possible the machine just died.

But usually there's a cause.

Was there anything different going on than usual?  Warmer than usual?  A lot more humid?  Power spike?  Recently changed any hardware?  Bumped into the case?  Dropped something into the case?  Ventilation still OK?  Dust accumulation?

Make sure all the connectors are fully inserted and try wiggling them.  Try re-seating the RAM, and if it still machine-checks, reseating the processor.

----------

## Zepp

 *Akkara wrote:*   

> It is possible the machine just died.
> 
> But usually there's a cause.
> 
> Was there anything different going on than usual?  Warmer than usual?  A lot more humid?  Power spike?  Recently changed any hardware?  Bumped into the case?  Dropped something into the case?  Ventilation still OK?  Dust accumulation?
> ...

 

It's been a bit warmer but nothing significant, it's connect to a UPS so I hope it wasn't a surge. Haven't changed anything, no big bumps or anything really.

i ran memtest86 last night, it passed 6 passes when i got up and is still going, no fails.

----------

## Zepp

Hmm so I am trying just the second stick of ram now and it hasn't MCE'd yet, it's been about 30 mins. What are the odds it passed memtest86 7 full passes but still had a bad stick of memory?

----------

## Akkara

 *Quote:*   

> What are the odds it passed memtest86 7 full passes but still had a bad stick of memory?

 

I've seen that happen.  Try the user-level memory tester, memtester (it's in portage).  Run it as root, and have it test all but a few 100MB (or whatever a freshly-booted system + X or whatever you run ends up using).

Edit/addendum: also it might not be a bad stick.  Are you overclocking anything?  Try turning it down some.

Or mobo capacitors might be starting to age and the Vcore or Vmemory regulator could be getting iffy.

----------

## Zepp

 *Akkara wrote:*   

>  *Quote:*   What are the odds it passed memtest86 7 full passes but still had a bad stick of memory? 
> 
> I've seen that happen.  Try the user-level memory tester, memtester (it's in portage).  Run it as root, and have it test all but a few 100MB (or whatever a freshly-booted system + X or whatever you run ends up using).
> 
> Edit/addendum: also it might not be a bad stick.  Are you overclocking anything?  Try turning it down some.
> ...

 

I don't overclock anything.

----------

## eccerr0r

Just for reference, a machine check exception is when the processor found itself in an impossible-to-recover state, a lot of the times it's due to some of the cache tag or parity protected processor state tables on the cpu.  Usually this happens due to bad power/cooling, cosmic ray strike, cpu failure (overclocking falls into this bucket), or *really* poorly written software.  Definitely should try to rule out what you can, usually the machine is stuck in some state that you can't do too much post-mortem without special tools when a MCE occurs.

I'd definitely start out by checking power supply, and motherboard capacitors... how old is the m/b?

Usually non-ECC/parity protected memory would not produce a MCE as it would have no way to determine whether a fatal bit flip occurred.  However There may be a chipset flag that may tell the cpu to take a machine check depending on what behavior the CPU was programmed to do - but this is kind of iffy, as on a commodity machine, not much is protected from random bit flips that could propagate into the cpu as a fatal error.

----------

## Zepp

 *eccerr0r wrote:*   

> Just for reference, a machine check exception is when the processor found itself in an impossible-to-recover state, a lot of the times it's due to some of the cache tag or parity protected processor state tables on the cpu.  Usually this happens due to bad power/cooling, cosmic ray strike, cpu failure (overclocking falls into this bucket), or *really* poorly written software.  Definitely should try to rule out what you can, usually the machine is stuck in some state that you can't do too much post-mortem without special tools when a MCE occurs.
> 
> I'd definitely start out by checking power supply, and motherboard capacitors... how old is the m/b?
> 
> Usually non-ECC/parity protected memory would not produce a MCE as it would have no way to determine whether a fatal bit flip occurred.  However There may be a chipset flag that may tell the cpu to take a machine check depending on what behavior the CPU was programmed to do - but this is kind of iffy, as on a commodity machine, not much is protected from random bit flips that could propagate into the cpu as a fatal error.

 

The entire computer was purchased in April 2006. I looked at the capacitors on the motherboard, I didn't notice any that looked damaged but beyond that I am not sure how to test if it is the motherboard or psu?

----------

