# Not sure what's broken (probably hardware)

## ziller

(mods feel free to move this to otw if necessary)

I'm having some issues with my new hardware and I'm not quite sure what's broken. Maybe some of you are more familiar with this kind of stuff and can easily point me where to go next...

Anyway, I recently put together a new box (rather simple, nothing fancy) and it ran okay for about a month or so. Then I decided to double its ram (from 4gb to 8gb), replace my HD with an SSD and go for the 2.6.30 kernel, all at once. Did that and it ran smoothly for about a week, until I one day discovered the system totally frozen, screens blank, nothing worked. So I rebooted, it ran for about a day, died again and wouldn't boot properly anymore. All I got was

"Bad page state in process 'swapper'" 

messages all over the place. Thought it was my kernel, so I booted back to the .29, that used to work okay, same thing.

So I booted to memtest86+, which reported about 2 million (no kidding) errors within 10 seconds and then died :/

Thought it was my RAM, so I unplugged the computer and started pulling out ram blocks to find out which one was dead and always booting to memtest86+. Curiously, every time I had completely shut down the machine and then re-run the test, I could never reproduce the memory errors! Only after the box had been running for a while did it start to fail... In the end I ended pulling out some random blocks and it ran again, sortof.

This was a week ago. This weekend, it started failing in brand new ways:

- firefox segfaulted thrice, openoffice crashed and pulled down the whole box with it

- kernel complained having ran out of memory (despite 8gb ram and 2gb swap) and processes started dying

- x died and i couldn't log in at tty anymore (was greeted with a call trace and then thrown out again)

So I tried various things:

- cranked up my fans (maybe something was overheating?)

- tried a dozen different ram combinations, even running with one block only

- underclocked my ram and reverted back to the configuration i had when i first built the thing

Every time memtest86+ reported errors at first reboot after a crash but was okay after pulling the plug.

And it ran for about 24 hours until I was greeted with a dead machine this morning. Tried to reboot it quickly, but all I got was a kernel panic "not syncing [something] pid 1: swapper tainted [something] [call trace]" and then nothing. Shut it down and left for work.

My next approach is to try with another ram vendor, if that doesn't work I don't really know what to do. Ship some parts back and claim they're faulty? Not even sure if the problem is ram, mobo or cpu!

Anyone got any ideas or recommendations?

----------

## snIP3r

hi!

the problem you described can have numerous causes, but the cpu and the motherboard are two hot candidates. it can also be the power supply. i would suggest to change the motherboard first and then the cpu - if this is possible for you to do. it would also be a good thing to monitor the power supply's voltages.

HTH

snIP3r

----------

## ziller

*sigh*, yeah that's what I was afraid of it might be. I'll probably start with getting a new mobo, since it's cheaper than the cpu. Just annoying that I can't really send anything back and demand a refund, since I don't know what's broken.

I've been monitoring the voltages, actually, but haven't spotted any deviation. Bit difficult to know for sure, though, since the system becomes completely unresponsive at the time of a crash...

----------

## MaximeG

Hi,

When memtest reports errors, the first thing to check is your RAM (what you've done.) then MB and CPU. Then HDD and other components must be checked in this order.

I'd recommend you to check these components before replacing them, if you don't have spare hardware, or don't know a friend with spare hardware. It's worth the money to ask a local store to have a look at it. Moreover, often these stores will not make you pay for the check if you buy the replacement stuff in their store.

Regards,

Maxime

----------

## pdr

From what I've read the power draw difference between an SSD and HD isn't much, and neither is the power for 2 more sticks of ram, so since it ran OK for a month I doubt it is the power supply (unless coincidentally it is starting to go bad).

Make sure you didn't knock loose either the CPU or chipset heat sinks, and make sure their fans are running.

The memory swapper seems to be the one flipping out - try disabling swap in /etc/fstab - maybe the kernel does not like swapping to the new drive (might be the way SSDs move stuff around to even out the wear).

----------

## miike.f

Hi,

I had similar problems (random freezes, blank screen when booting) with my box and upgrading the BIOS completely solved them. BIOS problems are quite rare but some odd hardware configurations might trigger them. So try flashing your BIOS before you tear the system apart. If you aren't dual booting you probably need to create a custom freedos boot disc or USB stick to be able to flash the BIOS as the flashing programs usually require DOS or Windows.

//miike

----------

## MaximeG

Mm, dunno.

Have you moved your stuff from HDD to SDD without really telling Gentoo ?

Have you tried with another distribution/ OS ?

Regards,

Maxime

----------

## ziller

 *pdr wrote:*   

> From what I've read the power draw difference between an SSD and HD isn't much, and neither is the power for 2 more sticks of ram, so since it ran OK for a month I doubt it is the power supply (unless coincidentally it is starting to go bad).
> 
> Make sure you didn't knock loose either the CPU or chipset heat sinks, and make sure their fans are running.
> 
> The memory swapper seems to be the one flipping out - try disabling swap in /etc/fstab - maybe the kernel does not like swapping to the new drive (might be the way SSDs move stuff around to even out the wear).

 

About power consumption - that's what I thought as well.

About heatsinks... everything appears to be okay. My system's anyway passive, only got a couple of low-spinning case fans so nothing to fail there. And unless my sensors are completely lying, there's no temperature variations in cpu (stable at about +32), system (at about +42) or gpu (at about +60).

Disabling the swap was the first thing I tried (thought kernel might not like my SSD at all), didn't help. In the end I put it back since I saw a couple of "running out of memory" messages...

 *miike.f wrote:*   

> I had similar problems (random freezes, blank screen when booting) with my box and upgrading the BIOS completely solved them. BIOS problems are quite rare but some odd hardware configurations might trigger them. So try flashing your BIOS before you tear the system apart. If you aren't dual booting you probably need to create a custom freedos boot disc or USB stick to be able to flash the BIOS as the flashing programs usually require DOS or Windows. 

 

Good idea, will do that. 

I think my Asus board actually has some fancy ez-flash thingy that lets me flash it from an usb stick...think I'll try how that works.

----------

## ziller

 *MaximeG wrote:*   

> Mm, dunno.
> 
> Have you moved your stuff from HDD to SDD without really telling Gentoo ?
> 
> Have you tried with another distribution/ OS ?
> ...

 

Well, my process was pretty much like this:

- boot to live cd, partition sdd, copy everything except /dev, /proc and such to sdd

- chroot to sdd, make sure stuff works more or less

- shut down, unplug hdd, plug sdd in place of hdd

- boot to live cd, chroot, reinstall grub, switch to noop scheduler

- reboot, recompiled kernel (upgrade to 2.6.30 that i mentioned)

since then I've had a change to run a couple of emerge worlds and everything seemed to run fine.

Haven't tried other OS. Could theoretically clean a few gb of the drive and dual boot to Arch, see how it'd handle...

I'd assume, though, that the memtest errors point to a hardware (rather than software) failure :/Last edited by ziller on Mon Jul 06, 2009 12:42 pm; edited 1 time in total

----------

## MaximeG

Yeah.

Worth the try before changing hardware.

----------

## ziller

An update. Flashed the new BIOS a week ago and it ran (with all 8gb of ram) nonstop for a week until I ran into the same issues as before: firefox segfaults, processes dying, complete system freeze and memtest giving a million errors.

I noticed that there was one ram/slot combination I hadn't yet tried so I reconfigured and memtest was happy again. The box is now running, but I wonder for how long...

When things start going down again, I think I'll try another RAM vendor next (far shot, but maybe all of my ram blocks are dead??). Next thing will be a replacement mobo.

Oh and thought this might be worth mentioning, I've got an Asus P5Q-E board and Corsair 8500C5DF RAM (x4)

...and thanks for all the replies!

----------

## pdr

Check in BIOS that it has set the right voltage and timing for your ram - I've had the autodetect muck up before (only on 1 motherboard). Although I have to say that it running for a week and the crashing is troublesome. Did you try memtest during the week, or only after things started failing? Anything extraordinary going on (eg you emerge --sync once a week and that is when it freaked out, or you do weekly backups then)?

----------

## ziller

Hm, yeah. I already once set the frequency manually, didn't help and it's back on auto now... voltage I didn't try yet at all, but will do that.

Only ran memtest after things crashed miserably. Interestingly, though, from my previous runs I've noticed a strange pattern: run for a week (about), then fail, unplug, shuffle ram until memtest is happy, then runs for a few days, repeats failure and again and again in shortening intervals. Unplugging the whole thing for a few hours seems to reset the "cycle" and it runs for a week.

Can't really figure out what would trigger the weekly failure, except that saturday and sunday are the days I actually use it the most (otherwise it just idles with an open vnc session from work from monday to friday). It nearly always crashes while I'm not sitting at the computer, though. Nothing fancy running...a few desktop apps.

If the thing stil runs when I get home today, might let memtest run fully (takes hours to complete with 8gb =/) and see what it says when nothing's crashed yet.

----------

## depontius

Have you tracked your various temperatures?  If the fails are happening on the days when you're using it the hardest, could overheating be a problem?  You do have lm_sensors installed, don't you?

----------

## ziller

 *depontius wrote:*   

> Have you tracked your various temperatures?  If the fails are happening on the days when you're using it the hardest, could overheating be a problem?  You do have lm_sensors installed, don't you?

 

Yep, see my post above. Everything's fine within the parameters and there's practically no deviation even under moderate use.

...and when I say "use the most", it means that I might check slashdot a bit more frequently than usually. It's anyway in use over vnc every day from work, there's not much more going on on weekends. It's not like I'd be running BOINC or recompiling world for fun from friday to sunday  :Smile: 

----------

## pdr

You are more tolerant than I. This is (past) the point where I would just buy new hardware. The only time I've seen failures-over-time has been heat related; keep in mind though that it can be indirect - a diode getting old and heat-intolerant in the power supply can lead to crashing because of the dirty power.

However since this started when you replaced/installed a hard drive, that implies a mechanical problem - you jostled something. But it is pretty unlikely that a mechanical problem (heat sink askew, wire loose) would run fine for a week before, unjostled, flaking out again.

I have to say that on some motherboard reviews on New Egg, I HAVE seen complaints (including on a board I bought) that it worked with 4GB but failed with 8. If you haven't, might want to google around and see if it is an issue with your board.

----------

## ziller

Well, googling did reveal a bunch of complaints on various forums, for instance this one sounds all too familiar... http://www.overclock.net/intel-motherboards/483320-asus-p5q-e-ram-slots-going.html

I'll try once more, keying in the voltages and frequencies manually but basically I've given up all hope already.  And reading the opinions online, I'm starting to mistrust P5Q-E and its ram handling capabilities. Afraid that if I buy the same board again, all this might repeat in a few weeks...

Need to find a more reliable replacement.

----------

## pdr

I buy almost all my stuff (and I buy WAY too much - currently replacing my server micro-atx with a Shuttle) from NewEgg.com - even if you don't buy from them the customer reviews are tremendously helpful. For example if you've got your heart set on a P45 board, the best rated (with 959 reviews) is "GIGABYTE GA-EP45-UD3R LGA 775 Intel P45 ATX Intel Motherboard - Retail". And right along with that is that it did not get 5 "eggs", so might want to reconsider the chipset. However, even if something is listed as 4 eggs, want to go read the reviews and find out what they are complaining about - I've had some where they were complaining about the Windows drivers/software/etc - who cares?

----------

## ziller

Actually, I do the same, except that I use a local vendor. The P5Q-E did look good according to customer reviews (ignoring complaints from vista users  :Very Happy:  ), but maybe I should have looked a bit deeper - Asus support forum revealed quite a bit of ram issues for the board.

Though, on the other hand, if you look hard enough you can probably find issues for any board, no matter how good it is.

On topic though, I tried keying in voltages and frequencies manually, according to Corsair's specs and pushing northbridge voltage up according to recommendations from Asus support when using 8gb of ram. Result: memtest runs okay right after boot but whole box crashed in a record-breaking 3 hours.

I pulled out half of ram and it's now running with 4gb (and manual settings) ,seemingly stable. I expect it to crash in a week, but that should be long enough to find a proper replacement board =)

----------

