# SMP unstable, why?

## jingo

Hi, 

I bought an extra CPU (Pentium2 400MHz) for my ASUS P2B-D motherboard today. The maschine have been running Redhat 7.3 server stable for almost 2 years by single processor operation. But know I installed the second CPU and started installing Gentoo, now it hangs and often bzip2 gives an error ex. by decomp of stage3-tarball.

I tryid "under"clocking the bus-speed to 83 MHz, but doesn't seem to help.

Any suggestions? Please!

Jingo

----------

## Paradigmbreak

Same problem.  I have a dual Athlon 1900 system that I previously ran RH and Suse on.  It is currently stuck (looks hung) in the untar phase of stage3 at e2fsck.  Was about to burn a new disk figuring it was a medium error.  Perhaps not.  Maybe I will start from stage 1 instead.

----------

## timfreeman

I've got an SMP system that works real well with gentoo and anything else .. what kernel are you using? are the chips identical?  what kind of RAM do you use? is it checked?  

I guess me not having similar problems (that have been fixed) is useless to you, but at least I may be able to tell you what works for me ..

look around too, there was a lot here about SMP when I was shopping for a motherboard.

----------

## id10t

No problem with either Gentoo or Slack on a dual P2-450 and now on a dual AMD 1.2ghz "palomino" core setup.

----------

## ronmon

Often, an SMP system will make hardware problems more evident, especially with RAM. I had a similar problem with a VP6 about a year and a half ago. Slack and W2K ran without any problems, but it crapped out during my first Gentoo install in a way that sounds similar to yours. It turned out that one stick of RAM was bad so I yanked it and that did the trick.

Check freshmeat for an app called 'memtest86'. It is extremely thorough and will tell you if that's your problem.

----------

## timfreeman

that is exactly what happened to me

I took some advice here and I left memTest on one day while I was out for 5 or 6 hours and discovered an intermittent ram error!

It's a nice program, very good idea to do once in a while to your computers, it fixed an intermittent compiling problem for me (though it wasn't on the SMP).  

(Plus, if the ram is junk, you might convince yourself to buy bigger sticks to replace them ..   )

----------

## jingo

I managed to install Gentoo using the single processor kernel on the livecd, compiling my own kernel with smp seemed to work... but it frooze again!   :Sad: 

How importent is the stepping? I found that my new P2 400mhz have stepping 2, where my old p2 400mhz has stepping 1 !!!

Right now runnning memtest86. It found a few errors by now... bad could this be because of the stepping being different?

----------

## timfreeman

(I don't know about the stepping) 

I'm not sure if it could have anything to do with your cpus.. does anyone know?  I think it just stresses the RAM in different ways .. you may have found your problem (the ram)!  A lot was fixed after I replaced the offending stick.

----------

## timfreeman

I'm interested in the stepping thing, so I queried my favorite search engine ( :Rolling Eyes: ) for "different stepping" smp .. and for "mixed stepping" smp

A lot of people report that Intel states that stepping codes, cache sizes, and related things shouldn't theoretically have an effect.

But people have also reported problems (or at least that they think it's the problem).  

here's a website on the cautious side:

http://new.linuxnow.com/docs/content/SMP-HOWTO-html/SMP-HOWTO-3.html

there's a lot out there on the subject.   but I'd definitely replace that RAM if it's giving you ANY trouble..

----------

## ronmon

I'm on my third SMP box in four years and I've never even thought about using unmatched procs. I don't doubt what timfreeman dug up, but I was unaware that it was possible to do. It is surely not an optimum configuration and is likely to be the root of your problem.

By using a non-SMP kernel you are effectively shutting off the unmatched processor and that's why it works. Just a guess, but when you boot the SMP kernel you probably get a bunch of 'EIP' messages as it dies.

----------

## jingo

I tried a different ram-stick. This one should be healthy, but memtest86 crashed during the test, caused a "unexpected interrupt" !?

I have been able to run quite long now, I emerged for about 3-4 hours without problems. With MAKE_OPTS="-j4" and using distcc too, which should cause both processors and my desktop computer to compile... it worked!

But of course it suddenly hang.

No, I am not getting any messages at all when it crashes, it simply hangs.

----------

## seigen

I have an asus a7m-266d.

Very recently memtest would easily pass the memory when I had slots 1 and 3 filled.  (1 closet to the nearest cpu).  (I waited till at least one test was done).

However when I actually tried to compile anything it would generally seg fault after a short time.

After moving the memory to  positions 2 and 3 (of 4 total) everything is fine  :Wink: .

I'm guessing that memtest only uses one cpu and thus doesn't test when the other cpu is accessing the same memory.  The length of the wires are a little different when that cpu access memory, i guess, and it affects the reliability somehow.. (again guessing).

At any rate if you have a dual cpu machine and can't think of anything else I suggest trying just one stick of your best memory in every slot you have.  At the very least, it can't hurt..  (If you get one stable, then try adding another and repeating..)

Currently I am running 2.6.0-test5-mm1 with a ti 4200 video card and it seems pretty stable.  The latest nvidia binaries do compile, but mplayer playback was corrupt and seems to cause instability so I switched to the nv driver that comes with xfree.  (I suspect the cause of the nvidia binaries instability is at least somewhat due to the asus a7m266-d's design, and can't really recommend the board to anyone.  The board wouldn't even boot with one geforce fx card.)

----------

## timfreeman

Please keep updating, this is interesting.  

I only ever had a problem on a non SMP .. I'd like to know if there is any way to deduce if 2 cpus with same ram sticks/configuration would crash vs. 1 cpu.   (I mean besides trying .. the reasons or hardware models/companies that are tried and true.)

----------

## ronmon

I also have an A7M266-D. For stability's sake, I only have one stick of RAM installed.

But it's 1GB ECC registered Samsung.  :Smile:  Solid as a rock.

----------

## jingo

According to http://users.erols.com/chare/mixed.htm there shouldn't be any problems related to mixing my two processors (stepping dA1 and dB0).

Testmem is running right now testing my RAM-stick in dimm 3. I will test all four dimm slots.

If memtest passes, how can I test to be sure it is stable under SMP kernel?

----------

## jingo

Been testing for a few hours now.

It doesn't seem to matter which dimm-slot is used for RAM.

No problem with single processor, but with both processors atached, I get "unexpected interrupt - halting" from memtest after max 5 min.

I have absolutly no clue anymore. I rised my bus-speed to 100MHz again, no change.   :Confused: 

----------

## timfreeman

any luck?

----------

## jingo

nope...

stoppet testing... but will do some more testing after my exams this winther.

----------

## Tuna

had a problem with smp machine hanging under heavy load. propably this was connected to the aic7xxx scsi drver. i tried different 2.4 kernels without luck. now it has been running without any hickups since i installed the 2.6 test kernel (3 weeks ago) on the machine...

----------

## timfreeman

Just a thought, jingo, you've upgraded the BIOS lately?  I read that this board also had some APCI compliance problems (but why would that show up in a memtest is beyond me.. ) 

Also I was looking around and found something interesting about this board.  Here the steppings are the same, but the speeds are different, weird. 

 *Quote:*   

> What about mixing Celeron and Pentium II processor ?
> 
> A system using a "re-enable" Celeron processor and a Pentium II processor with the same steppings may theorically work. 
> 
> Alexandre Charbey as made such a system: 
> ...

 

http://ouray.cudenver.edu/~etumenba/smp-howto/SMP-HOWTO-4.html

----------

## jingo

 *Tuna wrote:*   

> had a problem with smp machine hanging under heavy load. propably this was connected to the aic7xxx scsi drver. i tried different 2.4 kernels without luck. now it has been running without any hickups since i installed the 2.6 test kernel (3 weeks ago) on the machine...

 

I am using the aic7xxx scsi driver.

Which 2.6 release would you recommed? This maschine should run as server!

It is quite a new bios. I am not to happy to try out beta-version of bios's which the newest one is. There is very little information about differences in the bios.. no changelog!

I will give the 2.6 kernel a try soon.

----------

## jingo

I have been giving the 2.6 kernel a try with my SMP config. It seemed more stable then when running 2.4.20.

Still I get lost of segmentation faults, especially while compiling.

I turned preemption off in the kernel and scaled the SMP-parameter to optimeze for 2 cpu's.

During my testing the kernel didn't hang like 2.4.20 did, but the segmentation fault rendered the config unusable.

What to try next? Are there any tools for locating what the problem might be?

----------

## Spooky Ghost

Have you tried using the new processor by itself?  Run the system with some cpu / memory intensive stuff to see if this still has problems.  Try adding the old processor in the second slot if that works and see what happens.  If you're board isn't picky about which slot/socket is used when running UP try each individually.  Also, how is the cooling and the power supply?  Perhaps the extra processor adds just a bit too much heat or draws too much extra power.

----------

