# Gentoo server random crashs

## gui92

Hello

I run a gentoo box for a busy web server (dual Xeon 2.6 HT, 6GB RAM, 72GB RAID 5 SCSI Adaptec 2000S).

Kernel 2.6.17-r6 with last glibc and gcc 4.1.1

I'm facing random heavy crashes.

Sometime, after few hours or few days the services stop responding. No http, no ftp, no ssh (stuck at password prompt), only the ping continue to respond.

I manage to open a top and a dstat console before a crash happen, and during the failure the two process continue to respond and show that :

top - 14:13:12 up  1:47,  0 users,  load average: 920.84, 902.29, 399.07

Tasks: 278 total,   1 running, 268 sleeping,   0 stopped,   9 zombie

Cpu(s):  0.1%us,  0.1%sy,  0.0%ni, 24.8%id, 74.9%wa,  0.0%hi,  0.1%si,  0.0%st

Mem:   6232460k total,  2428772k used,  3803688k free,    51772k buffers

Swap:  4008208k total,        0k used,  4008208k free,   718260k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                              

10168 root      16   0  2244 1236  836 R    0  0.0   0:08.96 top                                                                  

    1 root      16   0  1480  520  452 S    0  0.0   0:01.82 init                                                                 

    2 root      RT   0     0    0    0 S    0  0.0   0:00.06 migration/0                                                          

    3 root      34  19     0    0    0 S    0  0.0   0:00.02 ksoftirqd/0                                                          

    4 root      RT   0     0    0    0 S    0  0.0   0:00.10 migration/1                                                          

    5 root      34  19     0    0    0 S    0  0.0   0:00.00 ksoftirqd/1                                                          

    6 root      RT   0     0    0    0 S    0  0.0   0:00.10 migration/2                                                          

    7 root      34  19     0    0    0 S    0  0.0   0:00.00 ksoftirqd/2                                                          

    8 root      RT   0     0    0    0 S    0  0.0   0:00.04 migration/3                                                          

    9 root      34  19     0    0    0 S    0  0.0   0:00.00 ksoftirqd/3                                                          

   10 root      10  -5     0    0    0 S    0  0.0   0:00.00 events/0                                                             

   11 root      10  -5     0    0    0 S    0  0.0   0:00.00 events/1                                                             

   12 root      10  -5     0    0    0 S    0  0.0   0:00.00 events/2                                                             

   13 root      10  -5     0    0    0 S    0  0.0   0:00.00 events/3                                                             

   14 root      10  -5     0    0    0 S    0  0.0   0:00.03 khelper                                                              

   15 root      10  -5     0    0    0 S    0  0.0   0:00.00 kthread                                                              

   20 root      10  -5     0    0    0 S    0  0.0   0:00.11 kblockd/0                                                            

   21 root      10  -5     0    0    0 S    0  0.0   0:00.03 kblockd/1                                                            

   22 root      10  -5     0    0    0 S    0  0.0   0:00.02 kblockd/2                                                            

   23 root      10  -5     0    0    0 S    0  0.0   0:00.04 kblockd/3                                                            

   24 root      10  -5     0    0    0 S    0  0.0   0:00.00 kseriod                                                              

   27 root      10  -5     0    0    0 S    0  0.0   0:00.00 khubd  

And

---procs--- ------memory-usage----- ---paging-- -disk/total ---system-- ----total-cpu-usage----

run blk new|_used _buff _cach _free|__in_ _out_|_read write|_int_ _csw_|usr sys idl wai hiq siq

  0   5   0|1618M   51M  701M 3717M|   0     0 |   0     0 | 317    27 |  0   0  25  75   0   0

  0   5   0|1618M   51M  701M 3717M|   0     0 |   0     0 | 308    13 |  0   0  25  75   0   0

  0   5   0|1618M   51M  701M 3717M|   0     0 |   0     0 | 319    21 |  0   0  25  75   0   0

  0   5   6|1618M   51M  701M 3716M|   0     0 |   0     0 | 386   112 |  1   0  24  75   0   0

  0   5   0|1618M   51M  701M 3716M|   0     0 |   0     0 | 316    21 |  0   0  25  75   0   0

  0   5   0|1618M   51M  701M 3716M|   0     0 |   0     0 | 316    15 |  0   0  25  75   0   0

  0   5   2|1618M   51M  701M 3716M|   0     0 |   0     0 | 334    39 |  0   0  25  75   0   0

  0   5   0|1618M   51M  701M 3716M|   0     0 |   0     0 | 325    25 |  0   0  25  75   0   0

  0   5   0|1618M   51M  701M 3716M|   0     0 |   0     0 | 319    33 |  0   0  25  75   0   0

After a hard rebbot, i can see the apache log show random errors like :

[Mon Sep 04 18:02:24 2006] [notice] child pid 2724 exit signal Segmentation fault (11)

*** glibc detected *** /usr/sbin/apache2: double free or corruption (out): 0xa7861a98 ***

[Tue Sep 05 06:38:04 2006] [notice] child pid 30660 exit signal Segmentation fault (11)

[Tue Sep 05 10:45:45 2006] [notice] child pid 3364 exit signal Segmentation fault (11)

[Tue Sep 05 11:59:00 2006] [notice] child pid 3916 exit signal Bus error (7)

Do you have an idea of what happen ?

I can understand an apache2 failure because overloading, but why the whole server crash ?

Please excuse my poor english.

Thanks for your help.

----------

## Kruegi

Could be a hardware error.

At first run a complete disk (fsck) and memory check (-> http://www.memtest.org).

Thomas

----------

## Janne Pikkarainen

I think there are two options (based on Apache errors): either you compiled the system with some über wicked CFLAGS or there is a hardware problem. I suspect the latter.

----------

## gui92

 *Janne Pikkarainen wrote:*   

> I think there are two options (based on Apache errors): either you compiled the system with some über wicked CFLAGS or there is a hardware problem. I suspect the latter.

 

I think the CFLAGS are very standard :

CFLAGS="-O2 -march=pentium4 -pipe"

CHOST="i686-pc-linux-gnu"

I run all the hardware test, without errors...

----------

## Janne Pikkarainen

Ok. So starts to sound like hardware error. During the years I've seen all kind of odd errors that in the perfect world shouldn't exist: for example, one brand-new server was installed with a CPU, which was originally meant for a server operating at different frontside-bus speed than our server (400 MHz vs 533 MHz or so). 

As a result the server booted, and even run its hardware tests ok. But problems started during Gentoo installation (a nice stress test, by the way  :Very Happy: ) - during couple of tries the symptoms varied from some random compilation errors to total hangups. Right after we replaced the CPU server has been trouble-free. 

Apache shouldn't throw messages like "Bus error" if hardware is ok. I suspect the CPU or the memory.

----------

## gui92

 *Janne Pikkarainen wrote:*   

> Ok. So starts to sound like hardware error. During the years I've seen all kind of odd errors that in the perfect world shouldn't exist: for example, one brand-new server was installed with a CPU, which was originally meant for a server operating at different frontside-bus speed than our server (400 MHz vs 533 MHz or so). 
> 
> As a result the server booted, and even run its hardware tests ok. But problems started during Gentoo installation (a nice stress test, by the way ) - during couple of tries the symptoms varied from some random compilation errors to total hangups. Right after we replaced the CPU server has been trouble-free. 
> 
> Apache shouldn't throw messages like "Bus error" if hardware is ok. I suspect the CPU or the memory.

 

Ok, thanks.

This kind of harware failure will not be easy to detect and solve :-/

I try to use lighttpd to see what happen with it.

----------

## Ast0r

You aren't, by chance, overclocking that CPU are you?

Also, running memtest86 would be a good idea.

----------

## gui92

 *Ast0r wrote:*   

> You aren't, by chance, overclocking that CPU are you?
> 
> Also, running memtest86 would be a good idea.

 

No, this is a 3 years old stock production server.

It was quite stable during his first year, and then we begin to have monthly then weekly random crashes.

I think it was due to overloading.

But since last week, we are facing hourly crashes, and now, without load (only named and postfix, i stop apache and mysql) the server crash after half an hour only.

I know the server's chipset (Intel) has memory limitation (it only accept few type of memory), but why this sudden instability fater many mounth of (relative) stability ???

----------

## r4d1x

sounds like hardware :/ .  The best part is, you get to go on a hunt to find out whats going bad!  *cheers*

----------

## Joel D.

I have a similar problem

My server crash after 10-12 days....

Nothing in the log and the hardward is new.  I changed the memory to be sure.. but it still crash...

HELP me please...

I have a AMD 2800 + Semptron 64 bits

MotherBoard: asus k8s-mx

512 meg corsair

----------

## Cinquero

Can you determine when exactly the instabilities started to occur? If it is software-related, try to determine what relevant software you have changed just before that started (kernel? gcc? toolchain upgrade?).

Have you run memtest86? Most common problems are probably memory timing/error problems...

If there is a problem with the CPU/memory, do a stress test with one of the scripts listed at:

https://stier.dynu.com/~moinmoin/MarksWiki/LinuxKernel/KernelTests

Check the power supply.

----------

## Joel D.

 *Cinquero wrote:*   

> Can you determine when exactly the instabilities started to occur? If it is software-related, try to determine what relevant software you have changed just before that started (kernel? gcc? toolchain upgrade?).
> 
> Have you run memtest86? Most common problems are probably memory timing/error problems...
> 
> If there is a problem with the CPU/memory, do a stress test with one of the scripts listed at:
> ...

 

Hi,

I changed the power supply, the memory and I did a memtest86 and all of those test was fine.   The computer is new...the instabilities started at the begening of the PC life.  I installed Gentoo 2005.1 and I had some problems with the sata drive.  I finally found the driver, but its was always crashing.... The server is runing apache2/pure-ftpd/ssh/mysql.

I'm from Quebec City in Canada... scuse me for my poor english.

Thanks alot

Joel

----------

## Cinquero

 *Joel D. wrote:*   

> ...
> 
> Joel

 

Run the kernel build stress test.

Is the crash time related to the room temperature? Or to the server load?

----------

## Joel D.

 *Cinquero wrote:*   

>  *Joel D. wrote:*   ...
> 
> Joel 
> 
> Run the kernel build stress test.
> ...

 

I will test the kernel build stress test.  I added a fan for the temperature and the serevr load is not high went it crash.

----------

## Cinquero

Hmmm... are you running an X server? If yes, switch to VESA driver and/or prevent it from starting at all. There are some notoriously instable graphics chips around...

Which gcc version do you use?

Do the fans on the graphics card still work?

----------

## Joel D.

 *Cinquero wrote:*   

> Hmmm... are you running an X server? If yes, switch to VESA driver and/or prevent it from starting at all. There are some notoriously instable graphics chips around...

 

No, the X server is not running. 

I read some posts about simlar problem and people are talking about DMA and sata driver problem..  Went i'm coping a big file by the FTP server on the LAN, sometime the server crash....in the "top" command the "wa" section for the CPU is to 100%...

Maybe this could help you.. I have the kernel 2.6.16-r9

The driver for the sata are those of the kernel (scsi low-level driver, SiS 964/180 sata)

Thanks

----------

## Cinquero

Well, then try

hdparm -d0 /dev/sda 

or so to disable DMA access. It won't give you an insane speed, but you will be able to check if it is related to DMA disk transfers.

You could also try bugzilla.kernel.org to see if the bug has been fixed in more recent kernel versions.

----------

## Joel D.

 *Cinquero wrote:*   

> Well, then try
> 
> hdparm -d0 /dev/sda 
> 
> or so to disable DMA access. It won't give you an insane speed, but you will be able to check if it is related to DMA disk transfers.
> ...

 

hdparm -d0 /dev/sda give me :

 *Quote:*   

> jdhosts ~ # hdparm -d0 /dev/sda
> 
> /dev/sda:
> 
>  setting using_dma to 0 (off)
> ...

 

Is it ok or I need to wait and see if it still crash ?

thank

----------

## Cinquero

 *Joel D. wrote:*   

> ...
> 
> thank

 

hmmm... ok, for /dev/sd* you need to use sdparm, but I don't know how to enable/disable DMA for SATA devices... maybe in the BIOS?

----------

## Joel D.

 *Cinquero wrote:*   

>  *Joel D. wrote:*   ...
> 
> thank 
> 
> use
> ...

 

I just tried to tranfert a 800 mo file with the FTP server, I started "top" with SSH and the "wa" in the CPU section went to 89% and then the server crashed...  

hdparm /dev/sda give me : 

 *Quote:*   

> 
> 
> /dev/sda:
> 
>  IO_support   =  0 (default 16-bit)
> ...

 

I don't have some warning errors in /var/log/message  and dmesg....

----------

## Joel D.

 *Cinquero wrote:*   

>  *Joel D. wrote:*   ...
> 
> thank 
> 
> hmmm... ok, for /dev/sd* you need to use sdparm, but I don't know how to enable/disable DMA for SATA devices... maybe in the BIOS?

 

Ok I will look in the bios

----------

## Joel D.

 *Joel D. wrote:*   

>  *Cinquero wrote:*    *Joel D. wrote:*   ...
> 
> thank 
> 
> hmmm... ok, for /dev/sd* you need to use sdparm, but I don't know how to enable/disable DMA for SATA devices... maybe in the BIOS? 
> ...

 

nothing in the bios about sata and DMA.  I installed sdparm and i'm now looking at it. 

I really think that the problem is the acces to the hard drive...  I very don't know what I need to do to fix it.

I retryed to copie a 800mo via FTP on the server and again the "wa" went to 100% after the load average was very high and then all was crashed....

----------

## gui92

 *r4d1x wrote:*   

> sounds like hardware :/ .  The best part is, you get to go on a hunt to find out whats going bad!  *cheers*

 

About my problem, it was due to Adaptec Raid ZCR Card. A chipset clip broke and hit the card, destroying a little chip.

It's SOLVED for me, thanks all.

----------

## Cinquero

 *Joel D. wrote:*   

> ....

 

You could disable DMA in libata.... as far as I have read elsewhere. But that sure ain't easy if you don't know C. Or try "ide=nodma" first.

Check if "local APIC" in the kernel config is disabled. I remember that that option caused problems on some systems.

----------

## Cinquero

You always copy data from network. Did you try copying data locally?

----------

## Joel D.

 *Quote:*   

> CONFIG_X86_LOCAL_APIC=y
> 
> 

 

I need to disable it ?

Thanks a lot

----------

## Joel D.

 *Cinquero wrote:*   

> You always copy data from network. Did you try copying data locally?

 

locally and with the network(FTP) its crash....

----------

## Cinquero

 *Joel D. wrote:*   

> I need to disable it ?

 

Disable it and check if the system is more stable then.

possibly unrelated (but who knows?): https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=186852

----------

## Joel D.

I will change of motherboard to have a better SATA controler... K8S-MX as to many problems.

I was looking to the Asus K8N-VM with a chipset of Nvidia... look good... what do you think about , Nvidia work well with gentoo ?

Thanks

----------

## Cinquero

Probably. But I personally have more trust in Intel chipsets...

----------

## Joel D.

I changed of board, now it work #1.

The transfert locatly and over the network is very fast and the computer look stable..

Thanks a lot Cinquero

----------

## Cinquero

 *Joel D. wrote:*   

> I changed of board, now it work #1.

 

Yeah, SIS chipsets are probably one of the worst. I'm always saying that we need specific _RECOMMENDATIONS_ due to personal experience and not just stupid compatibility databases...

----------

