# Random segfaults

## milothurston

I previously had some problems after a glibc upgrade, and the eventual fix was to rebuild the entire system from a stage 3 install using my old world file. Most of it runs except for some things that cause random segfaults: psql, wish, sed, postmaster and cupsd. There may be others, but I haven't run into them yet.

Considering that I rebuilt this system, have since done an emerge -e on the packages affecting the programs above, and checked the machine's memory with memtest86 (it passed) I find this very odd. Can anyone offer suggestions? If I can't fix these, I'll have to wipe all but /home and convert the machine to Debian as I need those things working soon. This is something I'd rather avoid, as you might guess.  :Wink: 

----------

## wrc1944

Have you investigated your cpu temps & cooling system?  Hot running systems can cause random problems like this. It might be as simple as re-applying the thermal "grease" between the heatsink and cpu (it does eventually dry out and loose effectiveness), and/or blowing accumulated dust out of the fans, etc. I'd use high quality stuff like artic silver, not the factory stock junk most systems come with, or stock "pads." 

You could quickly check temps in the bios by rebooting into the bios and looking there under "pc health," or something similar (assuming your system has a normal bios with normal settings). Otherwise, install gkrellm, and set it up under "built-in sensors" to check on fans and temps. Your kernel will need i2c stuff enabled as modules (for your system), and those modules listed in your /etc/modules.autoload.d/kernel-2.x) file, so they load on boot.

On AMD systems, cpu temps over 48-50C. can cause problems, no matter what AMD says- Intel is probably similar.

----------

## milothurston

Thanks. Temperature does not seem to be a problem at the moment, but I will keep an eye on it as this is an Athlon system. It also seems that though some of the affected programs appeared to work before, none are now working, so I wonder if there is a corrupted library somewhere or other that I can't locate.

----------

## wrc1944

If temps are OK, that's not it. Do you mean by running "emerge -e  on the packages affecting the programs" emerge -e system? I'd try an emerge -e system, then emerge the effected programs again afterwards. Have you done a "revdep-rebuild -p" to check on consistency? Before you do emerge -e system, what's your output of emerge --info? Have you checked at the Gentoo bugzilla site for these problems? It's a good resource for fixing stuff.

This problem can probably be fixed pretty quickly.

----------

## milothurston

To restore the system, I built binary packages (using a stage 3 install) for all the system packages, and then installed the packages on my broken system. After that, I did an emerge -e system and re-merged everything in the world file.

Some programs still failed to work. So, I tried emerge -e on each package, e.g. "emerge -e postgresql". Also, I did ldd on some of them, determined which packages owned the libraries ldd identified and then doing an emerge -e on those packages, too. I've not done a revdep-rebuild yet.

emerge --info is:

```

Gentoo Base System version 1.6.13

Portage 2.0.51.22-r3 (default-linux/x86/2005.0, gcc-3.3.6, glibc-2.3.4.20041102-r1, 2.6.14-gentoo-r1 i686)

=================================================================

System uname: 2.6.14-gentoo-r1 i686 AMD Athlon(tm) MP 2000+

dev-lang/python:     2.2.3-r5, 2.3.5-r2, 2.4.2

sys-apps/sandbox:    1.2.12

sys-devel/autoconf:  2.13, 2.59-r6

sys-devel/automake:  1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r1

sys-devel/binutils:  2.15.92.0.2-r10

sys-devel/libtool:   1.5.20

virtual/os-headers:  2.6.11-r2

ACCEPT_KEYWORDS="x86"

AUTOCLEAN="yes"

CBUILD="i686-pc-linux-gnu"

CFLAGS="-march=athlon-mp -Os -pipe -fomit-frame-pointer"

CHOST="i686-pc-linux-gnu"

CONFIG_PROTECT="/etc /usr/kde/2/share/config /usr/kde/3.1/share/config /usr/kde/3.2/share/config /usr/kde/3.3/env /usr/kde/3.3/share/config /usr/kde/3.3/shutdown /usr/kde/3.4/env /usr/kde/3.4/share/config /usr/kde/3.4/shutdown /usr/kde/3/share/config /usr/lib/X11/xkb /usr/lib/mozilla/defaults/pref /usr/share/config /usr/share/texmf/dvipdfm/config/ /usr/share/texmf/dvips/config/ /usr/share/texmf/tex/generic/config/ /usr/share/texmf/tex/platex/config/ /usr/share/texmf/xdvi/ /var/qmail/control"

CONFIG_PROTECT_MASK="/etc/gconf /etc/terminfo /etc/env.d"

CXXFLAGS="-march=athlon-mp -Os -pipe -fomit-frame-pointer"

DISTDIR="/usr/portage/distfiles"

FEATURES="autoconfig ccache distlocks sandbox sfperms strict"

GENTOO_MIRRORS="http://gentoo.osuosl.org/ ftp://distro.ibiblio.org/pub/linux/distributions/gentoo/ ftp://gentoo.chem.wisc.edu/gentoo/ ftp://gentoo.mirrors.pair.com/ ftp://mirrors.blueyonder.co.uk/mirrors/gentoo ftp://ftp.mirrorservice.org/sites/www.ibiblio.org/gentoo/ "

MAKEOPTS="-j3"

PKGDIR="/usr/portage/packages"

PORTAGE_TMPDIR="/var/tmp"

PORTDIR="/usr/portage"

PORTDIR_OVERLAY="/usr/local/bmg-gnome-current /usr/local/bmg-main /usr/local/portage"

SYNC="rsync://rsync.gentoo.org/gentoo-portage"

USE="x86 X aalib acl alsa apm audiofile avi berkdb bitmap-fonts bonobo bzip2 cdr crypt cscope cups curl directfb eds emboss encode esd exif expat f77 fam fbcon ffmpeg flac foomaticdb fortran gd gdbm ggi gif glut gnome gphoto2 gpm gstreamer gtk gtk2 gtkhtml guile hal idn imagemagick imap imlib ipv6 jpeg junit lcms ldap libg++ libwww mad mhash mikmod mng motif mozilla mp3 mpeg mysql ncurses nls nptl nptlonly ogg oggvorbis openal opengl oss pam pcre pdflib perl png postgres python quicktime readline real recode ruby sdl slang spell sse ssl svga tcltk tcpd tetex tiff truetype truetype-fonts type1-fonts udev usb vorbis wmf xine xml xml2 xmms xv xvid zlib userland_GNU kernel_linux elibc_glibc"

Unset:  ASFLAGS, CTARGET, LANG, LC_ALL, LDFLAGS, LINGUAS

```

I'll keep checking the forums, but I've not found any similar problems there yet.

----------

## wrc1944

Definitely run revdep-rebuild -p and see if you have any broken links. I'd be willing to bet you do. I also seem to recall reading a while back that the cflag -Os caused problems with compiling a few packages- I can't recall which ones, or if it's even relevant to your problem.

----------

## thiagonunes

Excuse my english, I'm from Brazil.

I already had the same problem, them I verified the temperature of the processor and it's was normal.  But I tried to changes it, then everything was solved.

----------

## wrc1944

One other thought- memtest86 does not always detect all memory problems. In fact, it really can't be totally relied on for troubleshooting ram to the point where you will know for sure it's completely good ram. If you have another known good stick you can use for a test, I'd do that,  or buy some really good ram and replace, once you've eliminated everything else. 

Have you checked the ram timings and voltage? Maybe the ram is overheating for some reason, like airflow problems, or too high voltage.  What type of system, bios, and ram do you have- cheapo ram can lead to problems.

----------

## milothurston

I've switched the -Os to -O2 and done a revdep-rebuild plus an emerge -e on the affected programs, but this has resulted in no change. In fact, they are definitely segfaulting all the time now. I've not found anything else that's affected.

I'll see what I can do about the memory. I'm not sure of the type except that it is 2GB of ECC DDR RAM. They board is a Tyan thunder K7 S2462 with 2x Athlon MP on it 

It looks like I will have to do the Debian upgrade at some point. I don't like Debian but it is a standard system at work now, and this machine is left over from before the standardisation, so there is pressure to do the switch anyway. :(

----------

## wrc1944

Just so I'm clear, am I correct in thinking the problem apps compile fine, but while running they randomly segfault? Are these indeed the only apps that do this? Do they always segfault on a specific action?

How did you check the cpu and motherboard temps- in the bios? If so, what are they? I'm still not convinced that's not the problem. With all the compiling, Gentoo uses the cpu a lot, and it will run hotter. BTW, that board does take the ECC type memory you mention, so the type is correct. But it would seem like you'd have other random segfaults on other apps if the memory or heat was an issue 

Have you tried taking off a side cover, and blowing a small desktop fan directly into the computer as a test. This usually eliminates airflow problems, and if the segfaults go away, you have a heat problem. 

One other thought.  Even though your running an x86 system (so-called "stable"), have you tried updating the programs giving you the trouble with ACCEPT_KEYWORDS="~x86" emerge -p yourprogram, so that portage brings in updated versions?  Since you said you updated glibc, maybe the others giving you problems need to be updated, too. It's worth a shot, and you can always go back to a previous version.

More thoughts:

Have you changed gcc versions lately?

Do your problem apps have the needed USE flags set in /etc/make.conf? 

Are you sure your kernel has dual cpu support enabled? Is there any special kernel support the problem apps need? I'm unfamiliar with them except cupsd.

Your gcc and glibc are pretty old. Some people use gcc-3.4.4 and glibc-2.3.5 on an otherwise pure x86 system. I've only run ~x86 systems from the beginning, so I don't really know the ropes on pure x86. 

Have you tried booting to a knoppix (or other) live cd, and running a few apps for a time, and seeing if you still get any random segfaults? If you do, it's virtually certain it's a hardware issue, probably heat.

----------

## milothurston

Thanks for your response, wrc1944. Answers are below:

What I initially thought was random segfaulting is in fact specific programs segfaulting immediately when I try to run them. I have also just installed openoffice-bin-2.0.0, and that segfaults, too.

Temperatures in the BIOS are about 45C - typical for this machine. It's not had this sort of trouble before. I am trying to get lm_sensors working so that I can check the tempertaure under heavier load.

Dust has been blown out quite recently, but I'll give it another try when I next get chance to take it down.

I tried upgrading packages, but with no success. I am going to try a newer version of gcc to see if that helps - I am wary of upgrading glibc again.

The only change of gcc was the upgrade to 2.3.5 that broke my system. I've now gone back to the previous version. /etc/make.conf looks OK, and the kernel configuration is fine. I've tried a couple of different kernels (gentoo 2.6.13-r5 and .14-r1), but this makes no difference. When running under knoppix or the gentoo rescue CD I don't see any segfault problems. All I can think of is that there is something from glibc 2.3.5 still tucked away somewhere.

----------

## wrc1944

Segfaulting immediately when you open them makes me think they aren't finding some needed lib, or worse yet, they have specific bugs in those versions, either the ebuild, or the program itself. 

On the other hand, if your AMD cpu temps get much over 45C., it's been my experience that can start causing weird stuff like this and I've built quite a few AMD systems over the last 5 years.

Have you tried gcc-3.4.4 and the latest glibc 2.3.5 and binutils?

I'm running out of ideas.  If you're finally getting down to the point of wiping and install debian, before you do that, maybe it's worth trying updating Gentoo to ~x86 in /etc/make.conf. Then emerge sync, emerge -uD system and emerge -e system again, and then emerge -e world.

BTW, I don't have lm_sensors installed, and gkrellm sensors works great with the kernel modules set in the kernel, and autoloading them at boot.. For example, my kernel config 12c section, and modules.autoload.d file:

# I2C support

#

CONFIG_I2C=m

CONFIG_I2C_CHARDEV=m

#

# I2C Algorithms

#

CONFIG_I2C_ALGOBIT=m

CONFIG_I2C_ALGOPCF=m

CONFIG_I2C_ALGOPCA=m

#

# I2C Hardware Bus support

#

# CONFIG_I2C_ALI1535 is not set

# CONFIG_I2C_ALI1563 is not set

# CONFIG_I2C_ALI15X3 is not set

# CONFIG_I2C_AMD756 is not set

# CONFIG_I2C_AMD8111 is not set

# CONFIG_I2C_I801 is not set

# CONFIG_I2C_I810 is not set

# CONFIG_I2C_PIIX4 is not set

CONFIG_I2C_ISA=m

# CONFIG_I2C_NFORCE2 is not set

# CONFIG_I2C_PARPORT is not set

# CONFIG_I2C_PARPORT_LIGHT is not set

# CONFIG_I2C_PROSAVAGE is not set

# CONFIG_I2C_SAVAGE4 is not set

# CONFIG_SCx200_ACB is not set

# CONFIG_I2C_SIS5595 is not set

# CONFIG_I2C_SIS630 is not set

# CONFIG_I2C_SIS96X is not set

# CONFIG_I2C_STUB is not set

# CONFIG_I2C_VIA is not set

CONFIG_I2C_VIAPRO=m

# CONFIG_I2C_VOODOO3 is not set

# CONFIG_I2C_PCA_ISA is not set

#

# Miscellaneous I2C Chip support

#

# CONFIG_SENSORS_DS1337 is not set

# CONFIG_SENSORS_DS1374 is not set

CONFIG_SENSORS_EEPROM=m

# CONFIG_SENSORS_PCF8574 is not set

# CONFIG_SENSORS_PCA9539 is not set

# CONFIG_SENSORS_PCF8591 is not set

# CONFIG_SENSORS_RTC8564 is not set

# CONFIG_SENSORS_MAX6875 is not set

# CONFIG_I2C_DEBUG_CORE is not set

# CONFIG_I2C_DEBUG_ALGO is not set

# CONFIG_I2C_DEBUG_BUS is not set

# CONFIG_I2C_DEBUG_CHIP is not set

--------------------------------------------

My /etc/modules.autoload.d/kernel-2.6 sensors modules- (your's might be different).

i2c-core

i2c-isa 

w83781d 

w83627hf

i2c-viapro 

i2c-dev

eeprom

adm1021

----------

## milothurston

Thanks for the sensors info. I've just upgraded gcc, and I'm trying another revdep-rebuild. Hopefully that will work. if not I'll try binutils and glibc.

This machine has always run at around 45 degrees, but not done this. If I can keep it going for a few months longer then I can get it replaced with a Dell Precision 650 or similar, which would be rather good.

----------

## wrc1944

Forgot- in the newer kernels, there is now another section, which need to be set accordingly. Don't know what chips are on your Tyan board- mine's a Via KX7.

#

# Hardware Monitoring support

#

CONFIG_HWMON=m

CONFIG_HWMON_VID=m

CONFIG_SENSORS_ADM1021=m

# CONFIG_SENSORS_ADM1025 is not set

# CONFIG_SENSORS_ADM1026 is not set

# CONFIG_SENSORS_ADM1031 is not set

# CONFIG_SENSORS_ADM9240 is not set

# CONFIG_SENSORS_ASB100 is not set

# CONFIG_SENSORS_ATXP1 is not set

# CONFIG_SENSORS_DS1621 is not set

# CONFIG_SENSORS_FSCHER is not set

# CONFIG_SENSORS_FSCPOS is not set

# CONFIG_SENSORS_GL518SM is not set

# CONFIG_SENSORS_GL520SM is not set

# CONFIG_SENSORS_IT87 is not set

# CONFIG_SENSORS_LM63 is not set

# CONFIG_SENSORS_LM75 is not set

# CONFIG_SENSORS_LM77 is not set

# CONFIG_SENSORS_LM78 is not set

# CONFIG_SENSORS_LM80 is not set

# CONFIG_SENSORS_LM83 is not set

# CONFIG_SENSORS_LM85 is not set

# CONFIG_SENSORS_LM87 is not set

# CONFIG_SENSORS_LM90 is not set

# CONFIG_SENSORS_LM92 is not set

# CONFIG_SENSORS_MAX1619 is not set

# CONFIG_SENSORS_PC87360 is not set

# CONFIG_SENSORS_SIS5595 is not set

# CONFIG_SENSORS_SMSC47M1 is not set

# CONFIG_SENSORS_SMSC47B397 is not set

# CONFIG_SENSORS_VIA686A is not set

CONFIG_SENSORS_W83781D=m

# CONFIG_SENSORS_W83792D is not set

# CONFIG_SENSORS_W83L785TS is not set

CONFIG_SENSORS_W83627HF=m

# CONFIG_SENSORS_W83627EHF is not set

# CONFIG_SENSORS_HDAPS is not set

# CONFIG_HWMON_DEBUG_CHIP is not set

----------

## thiagonunes

Again, excuse my english, i'm from Brazil.

But...

Try to reinstall your Gentoo.

First: tar cjplf /home/<your-user>/backup.tar.bz2

p: preserve permissions

l: stay in local file system

Them boot a livecd and try to reinstall. If you get random segfaults again them you do not have a software problem.

I had a machine with Debian and its works fine. When I tried to install gentoo I got random segfaults.  Then I distrusted of the processor and I changed it.  The processor was not overheating. Now its works well with Gentoo.

----------

## milothurston

I've finally got round to doing the following (lots of meetings at work kept me busy):

1. Upgrade to glibc-2.3.5 and gcc-3.4.4

2. emerge -e system and revdep-rebuild

3. Compile and modprobe lm_sensors drivers.

4. Reboot and check harware temp.

It seems that the idle temperature shown in the BIOS is about 10C higher than last time I checked - i.e. about 56-59C. When the machine is running, lm_sensors report that it gets 20C hotter than that! If true, this is certainly very dodgy indeed and I'll have to investigate getting this fixed or replacing the machine. Even if they are wrongly calibrated and the 56-59 result is correct then this is still too high.

Thanks, everyone, for your comments!

----------

## wrc1944

That's way, way too high for stable usage, and will definitely cause problems like that. Before I replaced the machine, I'd check the heatsink/fan contact to the cpu. It may have completely deteriorated, as this stuff does eventually lose it's effectiveness. Let me stress, the "machine" is virtually certain to be OK- it's just the cpu cooling rig is lacking. If you correct this, I'll bet your problems will go away. Get a serious heatsink/fan combo that's overkill for your cpu's, and apply artic silver contact grease properly. That will do it.

Another possiblility is that someone upgraded the cpus in this box, and used the old stock heatsink/fan used for lower powered cpus.

Still another possibility is that if the bios has voltage settings you can adjust, it has somehow gotten set to too high of a voltage for the cpu. This will cause drastic overheating, as you describe. If the voltage is set for a regular athlon-xp, it might be way too high for the athlon-mp  cpu's. Anyway, definitely check the cpu voltage in the bios if you can, and find out what the setting should be for your exact cpu model

My bet is on the heatsink/fan  and/or contact grease being inadequate, for whatever reason. This system isn't overclocked, is it?

----------

## wrc1944

Check this page out:

http://www.answers.com/topic/list-of-amd-athlon-xp-microprocessors

also:

http://www.amdboard.com/amdid.html

Your athlon-mp cpu's operate at either 1.60v, or 1.65 (the older mps are 1.75v).  If the bios reports correctly for your mp model, the heat problem is the heatsink/fan contact grease, and/or it's not an adequate heatsink/fan model, or your airflow is somehow drastically restricted. (Or, a combination of all of the above).

Let us know what happens.

----------

## milothurston

Thanks. The sensors report:

```

VCore 1:   +1.76 V  (min =  +1.66 V, max =  +1.82 V)

VCore 2:   +1.78 V  (min =  +1.66 V, max =  +1.82 V)

```

So, presumably it's the heatsinks.

I've got a tube of artic silver at home, and I will investigate whether I can get some decent CPU coolers paid for from the appropriate funds (not always an easy task).

----------

## wrc1944

If your athlon-mp cpu is a version that runs on 1.60v or 1.65v, and your bios is set to 1.75v, that's a big problem, and needs to be set properly.

I assume what your sensors report means is that the bios is set at 1.75v, which is only correct for athlon-mp  "Palomino" model 6 versions.

Athlon MP "Thoroughbred" (Model 8, 130 nm), and Athlon MP "Barton" (Model 10, 130 nm) versions are either 1.60v or 1.65v.

Which cpu's are you using- that would be crucial?  If you have kde, you can look in kinfo center under processors, and it should tell you the info.

----------

## milothurston

Sorry, I should have also said that cpuinfo reports that the processors are family and model 6 - presumably that's OK for 1.75v.

Do you have any recommendations for good cooling systems? I'm not really familiar with what will fit on an SMP board.

----------

## wrc1944

I don't know much about dual cpu boards, but if you do a google search for:

Tyan thunder K7 S2462 recommended heatsink

You'll get lots of info- I downloaded the manual for your tyan board, and it appears most AMD approved socket A heatsinks should fit OK. I'd suggest looking at some websites discussing this board and see what other users have found acceptable for their heatsinks. However,  I would think the actual model of heatsink/fan that the board came with would be at least adequate, even if it's probably not the best available.

But first,  assuming the heatsink fans (and other case fans) are working OK, I'd really look at the cpu/heatsink contact situation, and if the thermal material or pads are dried out, or missing, etc., reapply some good stuff and that should make a big difference in your temps. It might be as simple as that- I've seen this before, and it does happen, especially on older rigs. Be really careful taking the heatsinks off- it can be tricky and dangerous to the cpu or board if you don't have experience doing it. (or even if you do).  More recent boards have gotten much better about this, but older ones are sometimes tough.

BTW, I looked at a few websites, and there seems to be complaints about the earlier versions of this board's stability, that were addressed by Tyan in later revisions. But your problem seems to be heat, so that should be elimnated as a cause first.

----------

## milothurston

Thanks. I've changed heatsinks and CPUs before, so it should be fine, and still have plenty of thermal materials. I'll see what I can come up with.

I wonder if this is an early version of the board (it's a little over 3 years old), as I've had a few problems with it before (usually disk failures or data corruption). The age of this machine means it's no longer under warranty and will probably have to be retired from front-line service shortly anyway, which will give me more time to find a suitable solution to the cooling problem.

----------

