# System unstable. [closed]

## grooveman

I'm having difficulty with my new server.

It seems that it likes to crash a lot, when emerging packages.  I don't have anything more than a base system on it right now -- so it is far from over-loaded.

It is a dual quad-core system, AMD Opteron Barcelona with 8GB of registered ecc ram.

I have used memtest86 on the ram, and it passes.  I set the default config in the bios, and tried a few tweaks myself -- doesn't seem to make a difference.

At first, I thought this was a temperature issue, but now I know for a fact that it is not.  I have been watching the temp like a hawk as it compiles, and it hasn't been going above 40 degrees centigrade on either CPU.  

I am very much within the stable branch of the portage tree -- I'm not passing any controversial CFLAGS or LDFLAGS, as you can see in my make.conf:

```

CFLAGS="-march=k8 -O2 -pipe"

CXXFLAGS="${CFLAGS}"

CHOST="x86_64-pc-linux-gnu"

MAKEOPTS="-j9"

USE="hvm fbcon sdl vim"

FEATURES="ccache buildpkg collision-protect distlocks"

CCACHE_SIZE="2G"

CCACHE_DIR="/var/tmp/portage/ccache"

GENTOO_MIRRORS="http://mirror.datapipe.net/gentoo ftp://ftp.ndlug.nd.edu/pub/gentoo/ http://adelie.polymtl.ca/"

```

The only "risky" think I did was to put / on LVM2 RAID 5... but I don't see why this would be a problem...

I added ccache just to try to lay off the CPU a bit for purposes of heat -- not because this system needs any help  :Wink: 

This is what the system was doing on its last crash:

```
>>> Source compiled.

>>> Test phase [not enabled]: sys-process/procps-3.2.7

>>> Install procps-3.2.7 into /var/tmp/portage/sys-process/procps-3.2.7/image/ category sys-process

install -D --owner 0 --group 0 --mode a=rx uptime /var/tmp/portage/sys-process/procps-3.2.7/image//usr/bin/uptime

install -D --owner 0 --group 0 --mode a=rx tload /var/tmp/portage/sys-process/procps-3.2.7/image//usr/bin/tload

install -D --owner 0 --group 0 --mode a=rx free /var/tmp/portage/sys-process/procps-3.2.7/image//usr/bin/free

install -D --owner 0 --group 0 --mode a=rx w /var/tmp/portage/sys-process/procps-3.2.7/image//usr/bin/w

install -D --owner 0 --group 0 --mode a=rx top /var/tmp/portage/sys-process/procps-3.2.7/image//usr/bin/top

install -D --owner 0 --group 0 --mode a=rx vmstat /var/tmp/portage/sys-process/procps-3.2.7/image//usr/bin/vmstat

install -D --owner 0 --group 0 --mode a=rx watch /var/tmp/portage/sys-process/procps-3.2.7/image//usr/bin/watch

install -D --owner 0 --group 0 --mode a=rx skill /var/tmp/portage/sys-process/procps-3.2.7/image//usr/bin/skill

install -D --owner 0 --group 0 --mode a=rx snice /var/tmp/portage/sys-process/procps-3.2.7/image//usr/bin/snice

install -D --owner 0 --group 0 --mode a=rx kill /var/tmp/portage/sys-process/procps-3.2.7/image//bin/kill

install -D --owner 0 --group 0 --mode a=rx sysctl /var/tmp/portage/sys-process/procps-3.2.7/image//sbin/sysctl

install -D --owner 0 --group 0 --mode a=rx pmap /var/tmp/portage/sys-process/procps-3.2.7/image//usr/bin/pmap

install -D --owner 0 --group 0 --mode a=rx pgrep /var/tmp/portage/sys-process/procps-3.2.7/image//usr/bin/pgrep

install -D --owner 0 --group 0 --mode a=rx pkill /var/tmp/portage/sys-process/procps-3.2.7/image//usr/bin/pkill

install -D --owner 0 --group 0 --mode a=rx slabtop /var/tmp/portage/sys-process/procps-3.2.7/image//usr/bin/slabtop

install -D --owner 0 --group 0 --mode a=rx pwdx /var/tmp/portage/sys-process/procps-3.2.7/image//usr/bin/pwdx

install -D --owner 0 --group 0 --mode a=r uptime.1 /var/tmp/portage/sys-process/procps-3.2.7/image//usr/share/man/man1/uptime.1

install -D --owner 0 --group 0 --mode a=r tload.1 /var/tmp/portage/sys-process/procps-3.2.7/image//usr/share/man/man1/tload.1

install -D --owner 0 --group 0 --mode a=r free.1 /var/tmp/portage/sys-process/procps-3.2.7/image//usr/share/man/man1/free.1

install -D --owner 0 --group 0 --mode a=r w.1 /var/tmp/portage/sys-process/procps-3.2.7/image//usr/share/man/man1/w.1

install -D --owner 0 --group 0 --mode a=r top.1 /var/tmp/portage/sys-process/procps-3.2.7/image//usr/share/man/man1/top.1

install -D --owner 0 --group 0 --mode a=r watch.1 /var/tmp/portage/sys-process/procps-3.2.7/image//usr/share/man/man1/watch.1

install -D --owner 0 --group 0 --mode a=r skill.1 /var/tmp/portage/sys-process/procps-3.2.7/image//usr/share/man/man1/skill.1

install -D --owner 0 --group 0 --mode a=r kill.1 /var/tmp/portage/sys-process/procps-3.2.7/image//usr/share/man/man1/kill.1

install -D --owner 0 --group 0 --mode a=r snice.1 /var/tmp/portage/sys-process/procps-3.2.7/image//usr/share/man/man1/snice.1

install -D --owner 0 --group 0 --mode a=r pgrep.1 /var/tmp/portage/sys-process/procps-3.2.7/image//usr/share/man/man1/pgrep.1

install -D --owner 0 --group 0 --mode a=r pkill.1 /var/tmp/portage/sys-process/procps-3.2.7/image//usr/share/man/man1/pkill.1

install -D --owner 0 --group 0 --mode a=r pmap.1 /var/tmp/portage/sys-process/procps-3.2.7/image//usr/share/man/man1/pmap.1

install -D --owner 0 --group 0 --mode a=r sysctl.conf.5 /var/tmp/portage/sys-process/procps-3.2.7/image//usr/share/man/man5/sysctl.conf.5

install -D --owner 0 --group 0 --mode a=r vmstat.8 /var/tmp/portage/sys-process/procps-3.2.7/image//usr/share/man/man8/vmstat.8

install -D --owner 0 --group 0 --mode a=r sysctl.8 /var/tmp/portage/sys-process/procps-3.2.7/image//usr/share/man/man8/sysctl.8

install -D --owner 0 --group 0 --mode a=r slabtop.1 /var/tmp/portage/sys-process/procps-3.2.7/image//usr/share/man/man1/slabtop.1

install -D --owner 0 --group 0 --mode a=r pwdx.1 /var/tmp/portage/sys-process/procps-3.2.7/image//usr/share/man/man1/pwdx.1

install -D --owner 0 --group 0 --mode a=rx proc/libproc-3.2.7.so /var/tmp/portage/sys-process/procps-3.2.7/image//lib64/libproc-3.2.7.so

install -D --owner 0 --group 0 --mode a=rx ps/ps /var/tmp/portage/sys-process/procps-3.2.7/image//bin/ps

install -D --owner 0 --group 0 --mode a=r ps/ps.1 /var/tmp/portage/sys-process/procps-3.2.7/image//usr/share/man/man1/ps.1

true

rm -f /var/tmp/portage/sys-process/procps-3.2.7/image//var/catman/cat1/ps.1.gz /var/tmp/portage/sys-process/procps-3.2.7/image//var/man/cat1/ps.1.gz

cd /var/tmp/portage/sys-process/procps-3.2.7/image//usr/bin/ && ln -sf skill snice

cd /var/tmp/portage/sys-process/procps-3.2.7/image//usr/bin/ && ln -sf pgrep pkill

>>> Completed installing procps-3.2.7 into /var/tmp/portage/sys-process/procps-3.2.7/image/

ecompressdir: bzip2 -9 usr/share/man

strip: x86_64-pc-linux-gnu-strip --strip-unneeded -R .comment

   usr/bin/pwdx

   usr/bin/watch

   usr/bin/vmstat

   usr/bin/w

   usr/bin/top

   usr/bin/pmap

   usr/bin/uptime

   usr/bin/slabtop

   usr/bin/free

   usr/bin/tload

   usr/bin/pgrep

   usr/bin/skill

   lib64/libproc-3.2.7.so

   sbin/sysctl

   bin/kill

   bin/ps

./

./usr/

./usr/share/

./usr/share/man/

./usr/share/man/man1/

./usr/share/man/man1/tload.1.bz2

./usr/share/man/man1/free.1.bz2

./usr/share/man/man1/uptime.1.bz2

./usr/share/man/man1/pgrep.1.bz2

./usr/share/man/man1/pkill.1.bz2

./usr/share/man/man1/slabtop.1.bz2

./usr/share/man/man1/top.1.bz2

./usr/share/man/man1/ps.1.bz2

./usr/share/man/man1/skill.1.bz2

./usr/share/man/man1/w.1.bz2

./usr/share/man/man1/snice.1.bz2

./usr/share/man/man1/kill.1.bz2

./usr/share/man/man1/pwdx.1.bz2

./usr/share/man/man1/watch.1.bz2

./usr/share/man/man1/pmap.1.bz2

./usr/share/man/man8/

./usr/share/man/man8/vmstat.8.bz2

./usr/share/man/man8/sysctl.8.bz2

./usr/share/man/man5/

./usr/share/man/man5/sysctl.conf.5.bz2

./usr/share/doc/

./usr/share/doc/procps-3.2.7/

./usr/share/doc/procps-3.2.7/NEWS.bz2

./usr/share/doc/procps-3.2.7/BUGS.bz2

./usr/share/doc/procps-3.2.7/HACKING.bz2

./usr/share/doc/procps-3.2.7/sysctl.conf.bz2

./usr/share/doc/procps-3.2.7/TODO.bz2

./usr/include/

./usr/include/proc/

./usr/include/proc/pwcache.h

./usr/include/proc/procps.h

./usr/include/proc/slab.h

./usr/include/proc/version.h

./usr/include/proc/whattime.h

./usr/include/proc/wchan.h

./usr/include/proc/devname.h

./usr/include/proc/alloc.h

./usr/include/proc/sig.h

./usr/include/proc/readproc.h

./usr/include/proc/sysinfo.h

./usr/include/proc/escape.h

./usr/bin/

./usr/bin/pkill

./usr/bin/pwdx

./usr/bin/watch

./usr/bin/vmstat

./usr/bin/w

./usr/bin/top

./usr/bin/pmap

./usr/bin/uptime

./usr/bin/snice

./usr/bin/slabtop

./usr/bin/free

./usr/bin/tload

./usr/bin/pgrep

./usr/bin/skill

./lib64/

./lib64/libproc-3.2.7.so

./lib64/libproc.so

./sbin/

./sbin/sysctl

./bin/

./bin/kill

./bin/ps

>>> Done.

* checking 54 files for package collisions

>>> Merging sys-process/procps-3.2.7 to /

```

Though it has crashed in the middle of compilation as well (in other instances).

The CPU temp at the time of the crash was 40 degrees for CPU1, and 39 degrees for CPU2, with comparable system temps.  According to AMD, these processors do not get to be unstable until they are in the 50-70 degree range.

There is nothing in /var/log/messages that looks suspicious.

I'm locked out via ssh, but when I check the console, I find this on the screen:

```
INIT: Version 2.86 reloading
```

Which is not normal....

When I get to the console, I can sometimes login, but I cannot do much.   Typically, I cannot even execute a ps without the system hanging.  I tried to copy the last few lines of /var/log/messages (to post here), but the file mysteriously disappears.  I can't shutdown gracefully, it just hangs. So, I have to do a hard power-off.  When I reboot, the new file is gone.

Again, I'm not pushing this system at all, or doing anything risky.  I'm solidly in stable-land.  My only deviation is that Root is on RAID5 (LVM2), which I don't see as significant... I would like to think that LVM2 is more robust than this...

So.. that leaves two options, in my opinion -- 1) The kernel (bad config or bugged) 2) The hardware.

The hardware seems good, I have had it up and running for very long periods (days) with the liveCD,

To keep this post from getting too bloated, rather than post my .conifg, I link it here.  I am using gentoo-sources-2.6.22 (current stable).

This is a production machine, so it is very important that I get this right.  (Right = rock-solid).

I appreciate the help.  :Smile: 

G

----------

## BitJam

I would first try to determine if it is a hardware problem or a software problem.

If you try re-emerging procps, does it fail in the exact same place?  Identical failures indicate a software problem while random failures indicate hardware.  You could also trying doing the same emerge after booting off a liveCD and then doing the chroot.  If things emerge fine from the LiveCD then you've narrowed the problem down to your kernel.

If it is relatively easy to stick yet another drive into your box you could try doing an install on it to test if the RAID is causing your problem.  A re-install is probably a bit safer than a cp -a from your existing system (since your existing system might be borked) and since this is a server, you don't have to re-compile X or a desktop manager.

Like you, I'm suspecting hardware.  After heating, the most common hardware problem is the power supply especially with multiple fast hard drives.  Gentoo installs certainly exercise your file system so a power supply problem would explain why you are getting failures with RAID while emerging.  Ironically, if the problem is with the power supply and RAID then ccache could make things worse not better.

Another simple way to slow down your emerges is to use:

```
MAKEOPTS="-j1"
```

This might actually put less stress on your file system as well.  But if this makes things better you still want to track down the actual problem.

----------

## grooveman

Well... it has nothing to do with procps... that was just an example -- the last time it happened.  It bombs on any package -- there is no rhyme or reason to it.  I recompiled procps, and about 16 other packages after reboot and it happened to not crash.  The crash before that one was GCC.  The crash before that was something else...

I guess I should take it down to one, non-raided drive and see what happens (just for trouble-shooting purposes).

I should have plenty of power.  I talked with tech support at Tyan, and they agreed that 660 Watts should be plenty for that box....  I'm only running 3 drives. But, if problems continue, that will be my next step....

I guess I was hoping that someone would review my .config, and say: "Hey!  That doesn't belong in there.  Get rid of that, and all will be well!"...       :Sad: 

I've already spent a great deal of time on this one...

----------

## BitJam

I still suspect hardware.  It could well be that a Gentoo install stresses the hardware more than Tyan's internal stress tests.   For example, no matter how beefy the actual power supply is, there could be a problem on the motherboard that causes large transients in power supply usage to cause glitches.

If you suspect your kernel config, maybe you could try using the kernel from the LiveCD (I don't know if it supports LVM2 RAID though).

Another longshot you could try is to disable the SMP support in your kernel so only one CPU gets used.

It is a real shame that you can't get it to fail reliably.  Consistent failures would make it much easier to track down the problem.

----------

## grooveman

The only thing that keeps me from being a firm-believer in the hardware hypothesis is that it dosn't hard-lock... it is only a soft-lock.  If it sits in a soft-locked state long enough, it will turn into a hard lock.  Reminds me of systems running programs with bad memory leaks..

I haven't ruled-out hardware yet, mind you, but I still am looking toward the kernel...

I have remerged the world now with the gentoo live cd kerenel twice (145 packages), and it hasn't locked yet, this seems to support the notion that it is the kernel...

However, I have gone through my config pretty rigorously, and I don't see anything out of line.  The Gentoo kernel uses flat memory, without NUMA.  I am going to switch that off, and see how it goes.   I will scale it back to be a veritable clone of the gentoo live cd config, then trip things back on one by one.

It would really bug me if NUMA support was making it unstable... as this is the primary benefit of Opteron muiltiprocessors....  But if it were the case, maybe a bug report would need to be filed....

----------

## grooveman

I am still testing this...

I did a low-level format ala dd /dev/zero to each of my drives.  This took some time, but I found that one drive is a bit slower than the others...

```
sda:

12383.5 seconds at 20.2 MB/sec

sdb:

12361.7 seconds at 20.2 MB/sec

sdc:

13713.4 seconds at 18.2 MB/sec.
```

sdc was 22 minutes slower than the others...  Do you think that that drive being slower might be the problem?

----------------------

I took RAID out of the question, and I compiled my system with the same kernel config that is used on the boot cd.

Firs time I rebooted, it failed -- kernel panic.  I had a hunch on what it was, and I was right:  SATA and scsi drivers were modularized.  I  booted again to the livecd, chrooted and went to configure/compile the kernel.  It failed.  I made clean and made mrproper, then tried again.  It failed.  I had to clear out the whole kernel tree and re-emerge gentoo-sources.  I tried again, and it worked.  That made me a bit nervous...

I exited the chroot, unmounted the drive and did an fsck, and did find errors (even though it was cleanly unmounted from last time).

This happened twice.

Now it is up, and I have a base system on 1 drive, with the livecd kernel.  I'm remerging the world (121 files) three times consecutively.  We are 3/4 of the way into the second round...  We'll see.

----------

## BitJam

 *grooveman wrote:*   

> sdc was 22 minutes slower than the others...  Do you think that that drive being slower might be the problem?

 

It might be a symptom of the problem.  Errors with the drive could be slowing it down. If that drive is new, I would replace it.

 *Quote:*   

> I took RAID out of the question, and I compiled my system with the same kernel config that is used on the boot cd.

 

Was this a fresh install or did you copy the system from the RAID?

 *Quote:*   

> I exited the chroot, unmounted the drive and did an fsck, and did find errors (even though it was cleanly unmounted from last time).
> 
> This happened twice.

 

Big ouch.  Everything is pointing to disk problems.   Did you check the system log after any of this?  There may have been some bread crumbs left there.

It's puzzling that you ran into these problems after starting over with a single drive.  But puzzling is good because it tends to narrow down the possible problems.  If you can get it to consistently fail, that is almost as good as getting it all to work because it makes it easier to track down the failing component.

----------

## grooveman

 *BitJam wrote:*   

> Quote:
> 
>  *Grooveman wrote:*   I took RAID out of the question, and I compiled my system with the same kernel config that is used on the boot cd. 
> 
> Was this a fresh install or did you copy the system from the RAID? 

 

A fresh install.

 *BitJam wrote:*   

> 
> 
>  *Grooveman wrote:*   I exited the chroot, unmounted the drive and did an fsck, and did find errors (even though it was cleanly unmounted from last time).
> 
> This happened twice. 
> ...

 

The problem is that there is no error output in /var/log/messages, and nothing on the screen in terms of hard drive errors.  The low-level format has not indicated any problems either -- which seems to imply the disks are physically good...

----------

I recompiled the world last night 4 times with no problems using the gentoo livecd kernel.

I have just recompiled the kernel, using my config (when I was running raid -- but I am still on a single drive now).  I have started 3 "emerge -et world"s... we will see what happens.  Unlike the gentoo config, mine actually tries to utilize NUMA.  But, everything is set to the most stable settings I know of.

We'll see what happens...

----------

## grooveman

OKAY!  I finally got an error.

Second time through emerge -et world with my (NUMA) kernel, and I get the crash.

BUT --  This time I was able to salvage some error output.  There is a definite segfault, and instances where it shows init  trying to restart.  The strange thing, however, is that some of these problems were manifesting before I switched kernels.  I don't know what this means... Anyone have any ideas?

From /var/log/messages:

```
Nov 28 01:33:27 mybox sshd[29177]: Accepted keyboard-interactive/pam for root from 72.245.233.173 port 45372 ssh2

Nov 28 01:33:27 mybox sshd(pam_unix)[29184]: session opened for user root by root(uid=0)

Nov 28 01:40:01 mybox cron[22600]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons )

Nov 28 01:41:45 mybox init: Trying to re-exec init

Nov 28 01:50:01 mybox cron[20792]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons )

Nov 28 01:52:08 mybox conftest[21977]: segfault at 0000000000502000 rip 00002b3eed65d7df rsp 00007fffbd6d8590 error 6

Nov 28 02:00:01 mybox cron[30272]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons )

Nov 28 02:00:01 mybox cron[30274]: (root) CMD (rm -f /var/spool/cron/lastrun/cron.hourly)

Nov 28 02:10:01 mybox cron[29435]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons )
```

```
Nov 28 04:20:01 mybox cron[30492]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons )

Nov 28 04:22:18 mybox conftest[654]: segfault at 0000000000603000 rip 00002adf4591602b rsp 00007fff65418718 error 6

Nov 28 04:30:01 mybox cron[15455]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons )

Nov 28 04:31:56 mybox init: Trying to re-exec init                            Nov 28 04:40:01 mybox cron[19298]: (root) CMD (test -x /usr/sbin/run-crons &&

/usr/sbin/run-crons )                                                           Nov 28 04:50:01 mybox cron[3544]: (root) CMD (test -x /usr/sbin/run-crons && /

usr/sbin/run-crons )                                                            Nov 28 05:00:01 mybox cron[26861]: (root) 

```

```
Nov 28 06:20:01 mybox cron[31432]: (root) CMD (test -x /usr/sbin/run-crons &&

/usr/sbin/run-crons )

Nov 28 06:26:56 mybox init: Trying to re-exec init

Nov 28 06:30:01 mybox cron[1940]: (root) CMD (test -x /usr/sbin/run-crons && /

usr/sbin/run-crons )

Nov 28 06:37:42 mybox conftest[9483]: segfault at 0000000000603000 rip 00002b2

050c5c02b rsp 00007fff5a2d95e8 error 6

Nov 28 06:40:01 mybox cron[9072]: (root) CMD (test -x /usr/sbin/run-crons && /

usr/sbin/run-crons )

Nov 28 06:47:12 mybox init: Trying to re-exec init

Nov 28 06:50:01 mybox cron[22939]: (root) CMD (test -x /usr/sbin/run-crons &&

/usr/sbin/run-crons )

Nov 28 07:00:01 mybox cron[17312]: (root) CMD (test -x /usr/sbin/run-crons &&

/usr/sbin/run-crons )

Nov 28 07:00:01 mybox cron[17314]: (root) CMD (rm -f /var/spool/cron/lastrun/c

ron.hourly)

Nov 28 07:10:01 mybox cron[13853]: (root) CMD (test -x /usr/sbin/run-crons &&

/usr/sbin/run-crons )

```

```
Nov 28 08:40:01 mybox cron[27618]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons )

Nov 28 08:41:02 mybox init: Trying to re-exec init

Nov 28 08:49:15 mybox conftest[17919]: segfault at 0000000000603000 rip 00002aff400f802b rsp 00007fff6ae3e148 error 6

Nov 28 08:50:01 mybox cron[31080]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons )

Nov 28 08:58:45 mybox init: Trying to re-exec init

Nov 28 09:00:01 mybox cron[18385]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons )

Nov 28 09:00:01 mybox cron[18387]: (root) CMD (rm -f /var/spool/cron/lastrun/cron.hourly)

Nov 28 09:10:01 mybox cron[23947]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons )

Nov 28 09:20:01 mybox cron[26074]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons )
```

```
Nov 28 10:52:56 mybox init: Trying to re-exec init

Nov 28 10:56:01 mybox cron[4555]: (*system*) RELOAD (/etc/crontab)

Nov 28 11:00:01 mybox cron[27258]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons )

Nov 28 11:00:01 mybox cron[27260]: (root) CMD (rm -f /var/spool/cron/lastrun/cron.hourly)

Nov 28 11:01:13 mybox conftest[26344]: segfault at 0000000000603000 rip 00002b7dff79602b rsp 00007fffab7a1aa8 error 6

Nov 28 11:10:01 mybox cron[24851]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons )

Nov 28 11:10:47 mybox init: Trying to re-exec init

Nov 28 11:20:01 mybox cron[24553]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons )
```

```
Nov 29 00:23:14 mybox login[6362]: ROOT LOGIN  on 'tty1'

Nov 29 00:26:54 mybox conftest[31284]: segfault at 0000000000603000 rip 00002b733108502b rsp 00007fff79eb2d78 error 6

Nov 29 00:28:16 mybox sshd[25139]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=h-72-245-233-173.sfldmidn.covad.net  user=root

Nov 29 00:28:18 mybox sshd[25118]: error: PAM: Authentication failure for root from h-72-245-233-173.sfldmidn.covad.net

Nov 29 00:28:19 mybox sshd[25118]: Accepted keyboard-interactive/pam for root from 72.245.233.173 port 39674 ssh2

Nov 29 00:28:19 mybox sshd[26447]: pam_unix(sshd:session): session opened for user root by root(uid=0)

Nov 29 00:30:01 mybox cron[12116]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons )

Nov 29 00:36:34 mybox init: Trying to re-exec init
```

```
Trying to re-exec init

Nov 29 02:37:01 mybox cron[5221]: (*system*) RELOAD (/etc/crontab)

Nov 29 02:40:01 mybox cron[8216]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons )

Nov 29 02:42:37 mybox conftest[15652]: segfault at 0000000000603000 rip 00002aea2cae002b rsp 00007fff7e455318 error 6

Nov 29 02:50:01 mybox cron[26757]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons )

Nov 29 02:52:13 mybox init: Trying to re-exec init

Nov 29 03:16:56 mybox syslog-ng[5498]: syslog-ng version 1.6.11 starting

Nov 29 03:16:56 mybox syslog-ng[5498]: Changing permissions on special file /dev/tty12
```

I'm dyin' here...

Don't know if this is significant either:

```

Nov 29 03:16:56 mybox pnp: 00:08: iomem range 0xfea80000-0xfeabffff has been reserved 

Nov 29 03:16:56 mybox pnp: 00:08: iomem range 0xfee01000-0xfeefffff has been reserved 

Nov 29 03:16:56 mybox pnp: 00:0a: ioport range 0xca0-0xcaf has been reserved 

Nov 29 03:16:56 mybox pnp: 00:0a: iomem range 0xfec00000-0xfec00fff could not be reserved 

Nov 29 03:16:56 mybox pnp: 00:0a: iomem range 0xfee00000-0xfee00fff could not be reserved 

Nov 29 03:16:56 mybox pnp: 00:0d: ioport range 0xa00-0xa7f has been reserved 

Nov 29 03:16:56 mybox pnp: 00:0e: iomem range 0xe0000000-0xefffffff could not be reserved 

Nov 29 03:16:56 mybox pnp: 00:0f: iomem range 0x0-0x9ffff could not be reserved 

Nov 29 03:16:56 mybox pnp: 00:0f: iomem range 0xc0000-0xcffff has been reserved 

Nov 29 03:16:56 mybox pnp: 00:0f: iomem range 0xe0000-0xfffff could not be reserved 

Nov 29 03:16:56 mybox i2c-adapter i2c-1: unable to read EDID block.

Nov 29 03:16:56 mybox i2c-adapter i2c-1: unable to read EDID block.

Nov 29 03:16:56 mybox i2c-adapter i2c-1: unable to read EDID block.

Nov 29 03:16:56 mybox i2c-adapter i2c-3: unable to read EDID block.

Nov 29 03:16:56 mybox i2c-adapter i2c-3: unable to read EDID block.

Nov 29 03:16:56 mybox i2c-adapter i2c-3: unable to read EDID block.
```

[code]

Thanks.

G

----------

## BitJam

Google("trying to re-exec init") says that other people have had this error message in their logs.  One person tracked it down to a cron job that did prelinking.   There seems to be some consistency between the time cron starts and the time you get these messages so look at your cron jobs (in /etc/cron.*). 

Perhaps you could change daily and weekly jobs to hourly or faster in an attempt to make the errors happen quicker.  The faster you can make it fail the faster you will be able to track down the source of the problem.  I would suggest having one shell loop run the cron jobs over and over while another loop is emerging world.

----------

## grooveman

Yes, Saw that...

I don't believe that is the case here -- I don't have anything installed in cron except what comes with the base-layout and vixie-cron itself.

To rule it out, however, I have shut off the cron daemon...  I will keep compiling, and we will see what happens...

----------

## BitJam

If you reduce the frequency of failures before you have isolated the cause then you are making your situation worse, not better.

I was pretty sure that you hadn't changed any of the cron jobs when I suggested you run them continuously while emerging so I didn't think they were the cause of your problem.  But it does appear that running your cron jobs may trigger a problem, especially if they are run while you are emerging.

It's still not clear if your problem is caused by hardware or software.   One thing is clear: doing normal things (such as building the kernel) can trigger your problem.  If you can afford the restocking fee you could order a 2nd identical computer and see if it has the same problems.  That could confirm a hardware problem (I don't think it would rule out a hardware problem since it is possible that there is a problem with your platform that is only seen under Linux/Gentoo).

Otherwise, I suggest you try to increase the frequency of failures since the time it will take to find the problem will be proportional to the MTBF.

----------

## grooveman

I see what you are saying... I misunderstood.  In any event, I think it is important to reduce confounding variables here, and see if it is still replicable.  I need to prove this beyond the shadow of a doubt, narrow it down to the component, then make a phone call to the hardware vendor(s).

Unfortunately, we are beyond any restocking fees at this point...  But if I can narrow the problem and prove it is the hardware, I'm still well within the warranty.

In any case,  with cron off, I still had the problem.  I cleared all the crud from the logs, and have the last few lines as the system crashed.

```
Nov 29 06:00:14 mybox init: Trying to re-exec init

Nov 29 06:08:24 mybox conftest[3734]: segfault at 0000000000603000 rip 00002b41fd21a02b rsp 00007fffadd1dad8 error 6

Nov 29 06:17:59 mybox init: Trying to re-exec init
```

I cannot help but notice that each time it has a segfault, that it is regarding conftest.  I feel that this is an important clue, but I'm in over my head on this...  I'm not a kernel god nor C programemr.  

Why conftest?  Why is the segfault always the same?  It looks like conftest is part of autoconf...  Is that part of binutils or gcc?  Maybe I should try a different version?

I feel that at this point heat, disk drives, CPUS and ram have been ruled out...  Where do I go from here?

----------

## BitJam

There is no conftest on my system. It appears that a step in auto-configuration compiles conftest.c and eventually runs it.  For example,  this page mentions: 

```
configure:2386: checking for suffix of executables

configure:2388: gcc -o conftest    conftest.c  >&5

configure:2391: $? = 0

configure:2413: result: 
```

I suggest you try to track down which package you are getting the failure on and what test it was running when it failed.  Check your emerge logs or have it email you progress reports.  Your best option might be to inspect /var/tmp/portage/*/* after a failure.  You may find the remains of failed emerges there, maybe even the errant conftest.

Edit:  It's still not clear if this is hardware or software.  Failing at the same spot indicates software is involved but since it doesn't consistently fail, it seems it may be hardware or interrupt related as well.  It seems certain things can sometimes trigger the failure even if they are not the cause.

----------

## grooveman

Its not the same package, it differs every time.  Yes, conftest is part of autoconf, looks like sys-devel/autoconf...

I agree, it still smells like hardware...  I have opened a ticket with Tyan, so, I will jump through their hoops and see what they say (that should kill a few more days...)

Thanks, bit.

----------

## BitJam

While you are waiting for hoops to jump through:

The code that generates the segfault error message is in /usr/src/linux/arch/x86_64/mm/fault.c 

Google(segfault error codes) says the error code you get indicates a userspace write (big whoop).

Interestingly there is this comment right above the segfault error message code:

```
      /* Work around K8 erratum #100 K8 in compat mode

         occasionally jumps to illegal addresses >4GB.  We

         catch this here in the page fault handler because

         these addresses are not reachable. Just detect this

         case and return.  Any code segment in LDT is

         compatibility mode. */
```

This suggests another thing you can try is: reduce your RAM to 4G or less.

If you can get your hands on one of the conftest executables that is failing, we might be able to use gdb to figure out which instruction it is failing at since the address in the error message is constant.

----------

## BitJam

Regardless of the results of the reduced RAM test, I think your next step should be to post a carefully crafted message to the Linux Kernel Mailing List.  Those guys are usually keen on making sure Linux is tweaked to run on all of the latest and greatest hardware.  I think your best case scenario involves a kernel patch that fixes the problem.

The best help from your hardware vendor would be to have them try to replicate your problem on similar hardware they have in-house.  If they're not willing to do this and if the LKML brings you no joy then your third best option is what I suggested earlier, ordering a 2nd similar system and try to replicate the problem on it.  If the 2nd system has the same problem then return the 2nd system (perhaps with the slower disk drive).   The downside of course is that this is a lot of work for you and you will have to pay for the shipping both ways and the restocking fee.

----------

## grooveman

Good advice, bit.

As an update, it appears that it is gcc-4.1.2 that is consistently coming up with this segfault.

I have also opened a ticket with Tyan.  They wanted me to test one proc at a time.  I have done this with CPU-a, and I am finishing up with CPU-b.  I am runing an "emerge -et world" six consecutive times to try to crash it.  It has proven stable with one processor plugged in -- either one.  But it has proven unstable when BOTH are plugged in.  The segfault has a high probability of crashing the system if both CPUs are plugged in, but the system seem s tolerant if only one is in (the segfault still happens, but the system shrugs it off).

I have informed the appropriate people that maintain autoconf.  They seem unconcerned with my plight so far.  We'll see if I get any more out of them.

I will go to the kernel maintainers, as you suggest, if the problem persists and the hardware is confirmed to be good.  Like you also said, I am hoping that the good people at Tyan will be able to replicate this (I hope they are even willing), though I expect they will want to send me a new board first...  I am thinking that they did not stress test the new bios version well enough with these CPUs.  They probably installed windows 2003 server and pronounced it done...

----------

## BitJam

I didn't investigate to see if it is possible but I wonder if your problem is related to this problem with quad core Opterons.

----------

## grooveman

Geez... that's just great...

I bet that is it.  I await my new mobo, but I am very guarded regarding the outcome...

If this winds up being the problem, AMD just lost an extremely loyal customer..

----------

## BitJam

The Slashdot discussion mentions a binary only Redhat kernel patch that fixes this problem with only a 1% decrease in efficiency.   You might be able to purchase Redhat just to get their kernel with this patch (if you prefer Gentoo over Redhat).   Maybe they'll even let you "borrow" a kernel to see if it fixes this problem, or maybe your hardware vendor will spring for it.

----------

## eagle_cz

greetings guys,

i have run into similar problem. It has totaly porked my installation 2x. But later on it start to work and i ended just with 1 segfault during emerge world -e.

Absolutly same scenario.

I also noticed, that less optimalization im using (mtune instead of march) less segfaults i get.

Less CPU's i do have, less segfaults i do have.

Did somebody manage to find some patched related to this issues ?

I have 2 exactly same servers here and both exactly same problems.  :Smile: 

----------

## grooveman

There were two patches posted on the AMD support forum.

They helped increase the stability considerably -- but, they patches are not compatible with xensource.  If you are planning on using xensource, then you are SOL... I'm still baging my head against this wall..  

If you are using gentoo-sources, the patches should be a very welcome thing for you.

Check their support forums.

----------

## grooveman

These chips were, in fact, afflicted with the errata.  The patch did help, but didn't work with the xensource patches, and therefore is useless to me.

I wound up buying two more dual core 2.8ghz chips instead...

If you have the opportunity to buy Barcelona chips... I STRONGLY recommend you do NOT.  This cost me months and months of time, not to mention about $700 down the tubes...

AMD gave me two "replacement" chips (remember, there are no properly working barcelona chips out there as of this date), that were socket F FX-70.  Of course, they only gave these to me after quite a struggle, and stated that they would be shipped, and the matter would be closed.

Upon receiving the chips (FX-70s), they do not work with my mobo, and to make matters worse, there is only one mobo in the ENTIRE WORLD that works with these chips right now -- and it is very expensive, and not a standard ATX form factor...

Remember, I bought these chips last October... I lost 5 months cuz of these bozos... playing with their patches and tweaking my kernel for days on end, all the while losing sleep and losing $ on my business.  Stay away from the Barcelona's, and I would be very wary of the phenom chips as well, as they had errata problems too...

----------

