# Weird HW problems [solved]

## viy

I have p4 2.4 Prescott with 2 Samsung SP0812C SATA disks:

```

livecd root # lspci   

0000:00:00.0 Host bridge: Intel Corp. 82865G/PE/P DRAM Controller/Host-Hub Interface (rev 02)

0000:00:02.0 VGA compatible controller: Intel Corp. 82865G Integrated Graphics Device (rev 02)

0000:00:1e.0 PCI bridge: Intel Corp. 82801 PCI Bridge (rev c2)

0000:00:1f.0 ISA bridge: Intel Corp. 82801EB/ER (ICH5/ICH5R) LPC Bridge (rev 02)

0000:00:1f.1 IDE interface: Intel Corp. 82801EB/ER (ICH5/ICH5R) Ultra ATA 100 Storage Controller (rev 02)

0000:00:1f.2 IDE interface: Intel Corp. 82801EB (ICH5) Serial ATA 150 Storage Controller (rev 02)

0000:00:1f.3 SMBus: Intel Corp. 82801EB/ER (ICH5/ICH5R) SMBus Controller (rev 02)

0000:01:08.0 Ethernet controller: Intel Corp. 82562EZ 10/100 Ethernet Controller (rev 02)
```

A week ago, I've installed 2.6.8-r3 based system on it, everything was OK. I got software raid1 and lvm2 over it (root is on /dev/md0).

After we've migrated our old server on this mashine, we've got 3 "kernel panic" during 1 day, so we had to step back to the old server. You can find one of those panic messages here (it's Russian forum).

Here is "emerge --info" output:

```

livecd scripts # emerge --info

Portage 2.0.51-r2 (default-x86-2004.2, gcc-3.3.3, glibc-2.3.3.20040420-r0, 2.4.26-gentoo-r6 i686)

=================================================================

System uname: 2.4.26-gentoo-r6 i686 Intel(R) Pentium(R) 4 CPU 2.40GHz

Gentoo Base System version 1.4.16

Autoconf: 

Automake: 

Binutils: sys-devel/binutils-2.14.90.0.8-r1

Headers:  sys-kernel/linux-headers-2.4.21-r1

Libtools: 

ACCEPT_KEYWORDS="x86"

AUTOCLEAN="yes"

CFLAGS="-O3 -mcpu=pentium4 -march=pentium4 -mmmx -msse -msse2 -fomit-frame-pointer -pipe"

CHOST="i686-pc-linux-gnu"

COMPILER=""

CONFIG_PROTECT="/etc /usr/kde/2/share/config /usr/kde/3/share/config /usr/share/config /var/qmail/control"

CONFIG_PROTECT_MASK="/etc/gconf /etc/terminfo /etc/env.d"

CXXFLAGS="-O3 -mcpu=pentium4 -march=pentium4 -mmmx -msse -msse2 -fomit-frame-pointer -pipe"

DISTDIR="/tmp/distfiles"

FEATURES="autoaddcvs ccache distlocks sandbox"

GENTOO_MIRRORS="http://ftp.linux.ee/pub/gentoo/distfiles/ http://ftp.du.se/pub/os/gentoo http://vlaai.snt.ipv6.utwente.nl/pub/os/linux/gentoo/"

MAKEOPTS="-j2"

PKGDIR="/usr/portage/packages"

PORTAGE_TMPDIR="/var/tmp"

PORTDIR="/usr/portage"

PORTDIR_OVERLAY=""

SYNC="rsync://rsync.de.gentoo.org/gentoo-portage"

USE="acpi berkdb bzlib chroot crypt cups curl curlwrappers encode f77 foomaticdb gdbm gif gpm innodb ithreads jpeg kde libg++ libwww mad mikmod mmx motif mpeg mysql ncurses pam png postgresql readline samba sdl spell sse ssl threads truetype ucs2 unicode x86 xml2 zlib"
```

I guess the reason is HW failure, but what is it: memory, cpu, mb? I've been running memtest86 for about 3 hours --- RAM seems to be OK. Sensors shows, that temp is 25 C for both, cpu and mb (without any changes at all):

```
dumb root # sensors

it87-isa-0290

Adapter: ISA adapter

VCore 1:   +1.58 V  (min =  +0.00 V, max =  +4.08 V)   

VCore 2:   +2.56 V  (min =  +0.00 V, max =  +4.08 V)   

+3.3V:     +6.62 V  (min =  +0.00 V, max =  +8.16 V)   

+5V:       +4.27 V  (min =  +0.00 V, max =  +6.85 V)   

+12V:     +12.16 V  (min =  +0.00 V, max = +16.32 V)   

-12V:     -15.95 V  (min = -27.36 V, max =  +3.93 V)   

-5V:       -8.17 V  (min = -13.64 V, max =  +4.03 V)   

Stdby:     +4.19 V  (min =  +0.00 V, max =  +6.85 V)   

VBat:      +4.08 V

fan1:     2556 RPM  (min =    0 RPM, div = 8)          

fan2:        0 RPM  (min =    0 RPM, div = 8)          

fan3:        0 RPM  (min =    0 RPM, div = 8)          

M/B Temp:    +25 C  (low  =  +127 C, high =  +127 C)   sensor = thermistor   

CPU Temp:    +25 C  (low  =  +127 C, high =  +127 C)   sensor = thermistor   

Temp3:       +25 C  (low  =  +127 C, high =  +127 C)   sensor = diode
```

There is something strange about "Temp3" sensor: if I run "emerge -Dpuv world" on the second console, it jumps 4 to 6 degrees in 3 seconds  :Shocked: ! 

I didn't find out the reason myself, so I decided to reinstall the whole system from scratch, using 2.4 kernel. Everything was OK, but during bootstrap I got the following error:

```
cd ../obj_s;  gcc -DHAVE_CONFIG_H -I../ncurses -I. -I. -I../include  -D_GNU_SOURCE -DNDEBUG -O3 -mcpu=pentium4 -march=pentium4 -mmmx -msse -msse2 -fomit-frame-pointer -pipe -fPIC -fPIC -c ../ncurses/./widechar/lib_slk_wset.c

In file included from /usr/include/bits/errno.h:25,

                 from /usr/include/errno.h:36,

                 from ../ncurses/curses.priv.h:94,

                 from ../ncurses/widechar/lib_slk_wset.c:37:

/usr/include/linux/errno.h:-19019: internal compiler error: Segmentation fault

Please submit a full bug report,

with preprocessed source if appropriate.

See <URL:http://bugs.gentoo.org/> for instructions.

cd ../obj_s;  gcc -DHAVE_CONFIG_H -I../ncurses -I. -I. -I../include  -D_GNU_SOURCE -DNDEBUG -O3 -mcpu=pentium4 -march=pentium4 -mmmx -msse -msse2 -fomit-frame-pointer -pipe -fPIC -fPIC -c ../ncurses/./widechar/lib_unget_wch.c

cd ../obj_s;  gcc -DHAVE_CONFIG_H -I../ncurses -I. -I. -I../include  -D_GNU_SOURCE -DNDEBUG -O3 -mcpu=pentium4 -march=pentium4 -mmmx -msse -msse2 -fomit-frame-pointer -pipe -fPIC -fPIC -c ../ncurses/./widechar/lib_vid_attr.c

The bug is not reproducible, so it is likely a hardware or OS problem.

make[1]: *** [../obj_s/lib_slk_wset.o] Error 1

make[1]: *** Waiting for unfinished jobs....

make[1]: Leaving directory `/var/tmp/portage/ncurses-5.4-r5/work/ncurses-5.4/ncurses'

make: *** [all] Error 2

!!! ERROR: sys-libs/ncurses-5.4-r5 failed.

!!! Function src_compile, Line 79, Exitcode 2

!!! make failed

!!! If you need support, post the topmost build error, NOT this status message.
```

I'd like to know --- what is it? Should I change this "monster" to better one?

Thanks in advance.Last edited by viy on Mon Nov 08, 2004 9:49 am; edited 1 time in total

----------

## lunarg

It is most likely a real hardware problem, usually a problem with memory. A suggested link is this one:

http://bitwizard.nl/sig11/

It talks about sig11 (segmentation fault) during compiling, and what you could check to see where the problem really is.

One more thing though: memtest86 does not always find problems with memory. You can run it for hours and not find any problems, while replacing the memory (as a precaution) solves everything. It's best to try it.

----------

## viy

Very nice FAQ, thanks a lot!

Well, I've changed my RAM, nothing changes. What do I have now:

*) to be a good netizen, I've decided not to "emerge sync", but rather copy ${PORTDIR} from the other machine;

*) made a tbz2 file, sftp'ed it to the livecd-running weired server;

*) during tar -jxvf it complains, that archive is broken (in different place each time I ran tar).

On originating machine, tar -jvtf is OK.

So what is it, disk problems? I got 2xSATA running in raid1.

I've heard, that SATA drivers aren't pretty good in Linux yet, are they?

----------

## lunarg

 *viy wrote:*   

> Very nice FAQ, thanks a lot!

 

You're welcome.  :Smile: 

 *viy wrote:*   

> So what is it, disk problems? I got 2xSATA running in raid1.
> 
> I've heard, that SATA drivers aren't pretty good in Linux yet, are they?

 

At work, we have configured several systems with SATA in linux software raid1, and never had any problems (note that they are debian, not gentoo; some of them run 2.4, others run 2.6). Of course, it could very well be a disk-related problem, although, if that would be the case, you usually get errors in syslog (/var/log/messages) or 'dmesg'.

It might be a combined problem: eg. incompatible ram clock on the mobo (CL2 vs CL2.5 vs CL3), bad sockets (check whether all contacts are absolutely clean), some settings wrong in the BIOS (happened to us too), ...

There are a whole bunch of components in a pc, and they all can go bad...

----------

## viy

Some fresh news.

I were playing with Frequency controls in BIOS. There is a warning, that overclocking of SRC may leed to SATA devices to stop working. This is what I  finally reached  :Wink: 

I've restored fail-safe defaults, and up to now everything works better, then it used too. Also, I've replugged power and ide cables on my disks. 

Well, more info will be tomorrow, when I'll boot into the freshly installed system and will try to set up databases.

Thanks for help, lunarg.

----------

## lunarg

Glad to hear it's only that (and not faulty hardware).  :Smile: 

Let me know how it goes.

----------

## viy

 :Evil or Very Mad:  The same:

```
  CC      fs/nls/nls_base.o

  CC      fs/nls/nls_cp437.o

  CC      fs/nls/nls_iso8859-1.o

include/linux/cpumask.h: In function `num_booting_cpus':

include/linux/cpumask.h:189: internal compiler error: Segmentation fault

Please submit a full bug report,

with preprocessed source if appropriate.

See <URL:http://bugs.gentoo.org/> for instructions.

The bug is not reproducible, so it is likely a hardware or OS problem.

make[2]: *** [fs/nls/nls_iso8859-1.o] Error 1

make[1]: *** [fs/nls] Error 2

make: *** [fs] Error 2
```

Now during kernel compile. bootstrap.sh and emerge system went smoothly.

E-h-h --- I don't know, what to do now...  :Crying or Very sad: 

----------

## wnelson

Two things. 1st. try compiling without pre-emept enabled kernel. ie. Start the system with a kernel with out pre-emption enabled. I ran into this problem with kernel pre-emption enabled. 2nd, did you upgrade the bios of the system. If so, go back to the previous bios. Yet, I believe it is a problem with pre-emption enabled.

----------

## lunarg

Good to know. I have a similar (but less occuring) problem with one of our linux desktop clients too.

Another problem would be SMP support. I read somewhere that certain kernel options don't work well along with SMP. If turning off the pre-empt doesn't work, try that next.

----------

## viy

 *wnelson wrote:*   

> try compiling without pre-emept enabled kernel

 

No luck. I even turned off IDE1 bus (pirmary and secondary, master and slave), now I have only IDE2 and IDE3, both for SATA disks.

Well, regarding SMP: it's Prescott, and it supports Hyperthreading, so I have:

```
[*] Symmetric multi-processing support                                                        

(2)   Maximum number of CPUs (2-255)                                                          

[*]   SMT (Hyperthreading) scheduler support
```

in my "Processor type and features".

Going to try to switch off SMT at first (never used it before), and then SMP in general.

----------

## wnelson

What mother board are you using with the prescott chip? And what version of the bios?

----------

## viy

Gigabyte GA-8IG1000MK, AGP 8X / Dual Channel DDR, Intel 865G chipset

```
Award Modular BIOS v6.00PG

Intel 865G AGPSet BIOS for 8IG1000MK FE

12/22/2003-i865G-6A79AG0CC-00
```

And here you may take a look at my kernel config.

----------

## wnelson

I went to the Gigabyte site and there is a newer bios I suggest incremental upgrade of the bios to the most current. And I also recommend reading about how to upgrade the bios and all other info before doing so like recovering from a failed upgrade, etc. I think this is were you are having problems.

http://www.giga-byte.com/Motherboard/Support/BIOS/BIOS_GA-8IG1000MK.htm

----------

## wnelson

First disable HPET timer.........

----------

## viy

 *wnelson wrote:*   

> First disable HPET timer.........

 

It seems, that without it system is working better, at least I haven't got any segfaults yet.

Could you tell a bit, when this options is applicable then? Thanks.

----------

## viy

 *viy wrote:*   

> It seems, that without it system is working better, at least I haven't got any segfaults yet.

 

Not true, faults became a bit rarer.

Now, I've upgraded BIOS version to FH according to the link, provided by wnelson and am running test script for 4th hour (scrip can be found following link, provided by lunarg, somewhere in the middle of the page).

All seems pretty stable:

```
dumb root # uptime

 11:15:46 up  3:22,  2 users,  load average: 1.45, 1.75, 1.57
```

No segfaults or any other problems.

Thank you both, guys, I'm really glad the problem is over!

----------

