# Bad memory, or other?

## Drone1

Here's the long short of it. 

System was a stable Athlon64 mythbox, but decided to make it my desktop. Expanded to 2GB ram, from 1GB, added a few more HD's I had laying around, upgraded the power supply to compensate, and then set it back up with 2.6.26. System now hard locks on large emerges (kde-meta 300+ packages...; I got a little use flag happy....). I can leave the sytem running for days without errors or locking up. Its only when I put an emerge load on it, does it occur, and magic SysRq is of no use.

Ran memtest86 for 24hrs and ONLY came up with this:

```

WallTime        Cached  RsvdMem MemMap          Cache   ECC     Test    Pass    Errors  ECC Errs

24:45:00        1919M   212M    e820-Std        on      off     Std     30      23      0

Tst     Pass    Failing Address         Good            Bad             Err-Bits        Count Chan

8       0       0003234df88 - 803MB     00000000        00000000        00000001        23

```

Other than the 1GB upgrade, additional HD's (Original sata HD I was using on the mythbox retains its primary position, with no other HD's mounted), and the Power Supply upgrade, is that 1 bad bit enough to produce the problem I'm seeing? Everything else appears normal... i.e. log checks, etc....

Help is appreciated.

----------

## poly_poly-man

feels like processor overheating or dead PSU to me... more likely the latter.

poly-p man

----------

## cyrillic

 *Drone1 wrote:*   

> is that 1 bad bit enough to produce the problem I'm seeing? 

 

Yes

In my experience, gcc performs a much better stress-test than memtest86 does, so you probably have many marginal (bad under load) bits.

One thing you could try is underclocking your RAM in the BIOS.

----------

## poly_poly-man

 *cyrillic wrote:*   

>  *Drone1 wrote:*   is that 1 bad bit enough to produce the problem I'm seeing?  
> 
> Yes
> 
> In my experience, gcc performs a much better stress-test than memtest86 does, so you probably have many marginal (bad under load) bits.
> ...

 

well... there's also the chance that, since the cpu IS the memory controller, it's overheating/not getting enough power (those are mutually exclusive, btw  :Very Happy:  ), the cpu is a fault, not the memory.

Check everything, tho.

poly-p man

----------

## Akkara

Try memtester.  It is a user-level app used on a running system (run it as root), that tests memory "live".  It is in portage.  My experience is that it finds bad memory better and faster than memtest86+.

Regarding only one bit being off: what might be happening, is the added ram slows the signals on the bus just enough to be marginal.  If this is the cause, you can try bumping the CAS timing up by one (or, if the mobo lets you, 1/2), or downclock it by 5-10%.

----------

## Drone1

I was able to get kde-meta, and the list of others, emerged, but not by what you'd expect.

Begin Small Confession: There was 1 other change that was made in moving from the mythbox to the desktop setup. I intentionallly left it out (I apologize for the misdirection), because, 1) I was in denial as it was my first hunch as to what was causing the issues and 2) it would probably lead to the discovery of either faulty hardware or incompatible/incomplete driver/software support in linux. 

The change in question, was enabling onboard video. In BIOS, setting the 'Init Device' to 'Onboard' from 'PCIx' (auto option does not exist, which I find strange...) ,and giving 'VideoFB' a ram setting, the system becomes unstable during emerges. I know this as fact as switching it back to 'PCIx', the emerge ran through the night and finished as I was leaving for work this morning... 

So, if I enable onboard video, use shared memory as onboard video does, that will lead to a hardlock during emerge. 

Questions:

1) how do I go about troubleshooting this so I can utilize the onboard video? I'm not going to run kde, then disable the onboard video every time I need to emerge... F that.

2) I feel the bad memory bit isn't an issue at this point, since it completed that rather large emerge (it took 8 hours....), but my coworker informed me I could add a 'mem' variable to the append line for the kernel in lilo, which would drop/ignore/overstep the use of that bit. 

3) since I don't know what exactly is going on between the shared memory for the video, and linux' use/allocation between the 2, is there a way to restrict what memory block the onboard video uses? is what I'm asking even possible.... would it even help..... am I going in the right direction here???

I'm willing to do whatever it takes to solve this problem, cause well..... I'm not buying new hardware for a while, and I find this is slightly uncharted territory for me, so I'll learn something new. 

Thoughts?

----------

## cyrillic

 *Drone1 wrote:*   

> So, if I enable onboard video, use shared memory as onboard video does, that will lead to a hardlock during emerge. 

 

In my mind, this pretty much proves that you are dealing with a bad RAM issue.

----------

## pappy_mcfae

I'll second that. Pull the new stick of RAM, and retry.

Blessed be!

Pappy

----------

## jcat

 *Drone1 wrote:*   

> 
> 
> 2) I feel the bad memory bit isn't an issue at this point, since it completed that rather large emerge (it took 8 hours....)

 

Bad memory is always an issue.  I would advise pulling out the new ram.  If you're going to upgrade the memory maybe consider replacing all ram so you have a properly matched pair.  That would reduce the chance of issues.

Cheers,

jcat

----------

## Drone1

Actually, both sets of sticks are matched pairs. 1st set was bought when the system was built mid-06, second set was bought in january. ALL are of same model, brand, timing, amount, etc... Going through memory tests right now, to see which stick/s or sets, are bad. I'll have to look and see if I can RMA...

1 thing that did bug when I first got them, though... 1st pair in, memory came at correct speeds. Tried the 2nd set by itself, and they came up the correct speeds. Install both sets, using both banks of the mobo... , and BIOS (set to autodetect) clocked them at a lower speed... 

So, individually they run at correct speed, but once both banks are used, they're seen at a slower rate? Would that be a BIOS issue?

----------

## jcat

Would be worth updating the BIOS if there is a newer version available.  Definitely worth a try.

Cheers,

jcat

----------

## Akkara

 *Quote:*   

> So, individually they run at correct speed, but once both banks are used, they're seen at a slower rate?

 

That can happen.  Driving two sets of ram loads the bus down and slows the signals.  Depending on how close to the edge the timing was to begin with, and how much drive the controller has, it can definitely slow things down.

The memory error you saw might suggest it isn't being slowed down enough.  Also, some brands of sticks could load the bus more than others.

There's sometimes obscure bios settings such as 'memory drive termination voltage' and similar, that might help.  (Be careful with these however, more isn't always better.)

----------

