# kernel panic on emerge [URGENT]

## hanj

Hello All

The last few days I've been experiencing this problem during emerges. When I emerge a package (again this is intermittent) and around md5 check and unpack.. it freezes. Viewing the console.. I get this message

```
...

[<c016600f>]

[<c0127a29>]

Code:  Bad EIP value.

 <0>Kernel panic - not syncing: Fatal exception in interrupt
```

At first I thought it was a recent change in the kernel I made. I added SMP support in there. Yesterday, I built a new kernel and removed that feature set. Today it happened again. I just rolled back to an older kernel this morning.. and have not seen any problems as of yet.

My quesion.. could this be kernel vesion, config problem.. or am I looking at a hardware problem?

The kernel config has been the same for about 2 years with the exception of SMP which I just added and removed. I feel like the config should be 'good'.

Here is my relevent info. 

The kernel version that I was running was..

```
2.6.16-hardened-r11
```

I just rolled back to

```
2.6.16-hardened-r10
```

```
emerge --info

Portage 2.1-r2 (default-linux/x86/2006.0, gcc-3.4.6, glibc-2.3.6-r4, 2.6.16-hardened-r10 i686)

=================================================================

System uname: 2.6.16-hardened-r10 i686 Intel(R) Pentium(R) 4 CPU 2.80GHz

Gentoo Base System version 1.12.4

app-admin/eselect-compiler: [Not Present]

dev-lang/python:     2.3.5-r2, 2.4.3-r1

dev-python/pycrypto: 2.0.1-r5

dev-util/ccache:     [Not Present]

dev-util/confcache:  [Not Present]

sys-apps/sandbox:    1.2.17

sys-devel/autoconf:  2.13, 2.59-r7

sys-devel/automake:  1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r2

sys-devel/binutils:  2.16.1-r3

sys-devel/gcc-config: 1.3.13-r3

sys-devel/libtool:   1.4.3-r4, 1.5.22

virtual/os-headers:  2.6.11-r2

ACCEPT_KEYWORDS="x86"

AUTOCLEAN="yes"

CBUILD="i686-pc-linux-gnu"

CFLAGS="-O2 -march=pentium4 -funroll-loops -fprefetch-loop-arrays -pipe"

CHOST="i686-pc-linux-gnu"

CONFIG_PROTECT="/etc /var/bind"

CONFIG_PROTECT_MASK="/etc/env.d /etc/gconf /etc/revdep-rebuild /etc/terminfo"

CXXFLAGS="-O2 -march=pentium4 -funroll-loops -fprefetch-loop-arrays -pipe"

DISTDIR="/usr/portage/distfiles"

FEATURES="autoconfig distlocks metadata-transfer sandbox sfperms strict"

GENTOO_MIRRORS="http://distfiles.gentoo.org http://distro.ibiblio.org/pub/linux/distributions/gentoo"

MAKEOPTS="-j2"

PKGDIR="/usr/portage/packages"

PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --delete-after --stats --timeout=180 --exclude='/distfiles' --exclude='/local' --exclude='/packages'"

PORTAGE_TMPDIR="/var/tmp"

PORTDIR="/usr/portage"

PORTDIR_OVERLAY="/usr/local/portage"

SYNC="rsync://rsync.gentoo.org/gentoo-portage"

USE="x86 alsa apache2 berkdb bitmap-fonts cli crypt dlloader dri eds emboss esd foomaticdb fortran gdbm gif gstreamer imlib isdnlog jpeg libg++ libwww mp3 ncurses nptl ogg pam pcre pdflib perl php png pppd python qt3 qt4 readline reflection sasl session snortsam spell spl ssl tcpd truetype-fonts type1-fonts udev vorbis xml xorg zlib elibc_glibc input_devices_keyboard input_devices_mouse input_devices_evdev kernel_linux userland_GNU"

Unset:  CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, LANG, LC_ALL, LDFLAGS, LINGUAS, PORTAGE_RSYNC_EXTRA_OPTS
```

```
cat /proc/meminfo

MemTotal:       767020 kB

MemFree:        672240 kB

Buffers:          9320 kB

Cached:          45536 kB

SwapCached:          0 kB

Active:          50824 kB

Inactive:        28756 kB

HighTotal:           0 kB

HighFree:            0 kB

LowTotal:       767020 kB

LowFree:        672240 kB

SwapTotal:      979956 kB

SwapFree:       979956 kB

Dirty:               0 kB

Writeback:           0 kB

Mapped:          39844 kB

Slab:            10444 kB

CommitLimit:   1363464 kB

Committed_AS:   210700 kB

PageTables:        936 kB

VmallocTotal:   253944 kB

VmallocUsed:      2100 kB

VmallocChunk:   251844 kB
```

```

processor       : 0

vendor_id       : GenuineIntel

cpu family      : 15

model           : 4

model name      : Intel(R) Pentium(R) 4 CPU 2.80GHz

stepping        : 1

cpu MHz         : 2793.507

cache size      : 1024 KB

fdiv_bug        : no

hlt_bug         : no

f00f_bug        : no

coma_bug        : no

fpu             : yes

fpu_exception   : yes

cpuid level     : 3

wp              : yes

flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx constant_tsc pni monitor ds_cpl cid xtpr

bogomips        : 5597.60
```

[update].. just happened again with 2.6.16-hardened-r10!!! I know I've been running with this kernel for awhile in the past without any problems.

This is the as far as I get on emerge...

```
emerge -v baselayout

Calculating dependencies... done!

>>> Emerging (1 of 1) sys-apps/baselayout-1.12.4-r6 to /

>>> checking ebuild checksums ;-)

>>> checking auxfile checksums ;-)

>>> checking miscfile checksums ;-)

>>> checking baselayout-1.12.4.tar.bz2 ;-)
```

Let me know if I can supply any additional information

Thanks!

hanji

----------

## NeddySeagoon

hanj,

Intermittent problems like this are normally indicative of a hardware problem.

The unpack is very CPU and RAM intensive, so the CPU temerature rises.

Check your cooling - use a stiff brush to clean your fan/heatsink assembly.

Run lm-sensors to keep and eye on temperatues (there are other ways)

Run memtest86 from the liveCD.

If that doesn't help, its time to begin removing things.

Operate with one stick of RAM, test each one individually this way.

----------

## hanj

So no easy way to narrow it down to CPU or RAM problem? Could this be drive as well? This is one of my development servers, so I really can't have it down for hours doing memtest, etc... but if I have to.. I have to.

I'll try to get lm_sensors on there.. I don't think the board supports it very well. It's a Dell PowerEdge SC420.

Any tips to help narrow down the problem. I looked at the fan.. it looks good and clean.

Thanks for the reply

hanji

----------

## hanj

 *NeddySeagoon wrote:*   

> Intermittent problems like this are normally indicative of a hardware problem.
> 
> The unpack is very CPU and RAM intensive, so the CPU temerature rises.

 

It's odd that I can compile a kernel though.. seems like that should be torkin' on the CPU/RAM harder than an unpack. I would assume that it's pushing RAM.. right? Any quick way to push on RAM?

Thanks again.. I'm just thinking out loud.

hanji

----------

## NeddySeagoon

hanj,

Since you can compile a kernel, its probably not the CPU. Kernel compilies do not push RAM, its lots of small files, unlike unpack but that will use RAM for data, rather than program code and your crash comes from program code - executing data normally leads to different errors.

I doubt its a drive, or a data cable. reading corrupt data from the disk surface would be detected by the drives CRC checks.

Getting program code into RAM that was written incorrectly would be repeatable - it would be wrong every time.

You can run memtest86 without using the liveCD but its not very useful. It can get swapped out and moved around in physical RAM, also, it can't test all of RAM because some things can't be swapped/moved.

Reduce the box to one stick of memory - see if it still happens. Try each stick in turn, on its own.

Pulling a memory stick and seeing the problem go away does not imply the stick you pulled is faulty.

Disturbing the contacts between the RAM and montherboard socket can fix it too.

----------

## hanj

 *NeddySeagoon wrote:*   

> You can run memtest86 without using the liveCD but its not very useful. It can get swapped out and moved around in physical RAM, also, it can't test all of RAM because some things can't be swapped/moved.
> 
> Reduce the box to one stick of memory - see if it still happens. Try each stick in turn, on its own.
> 
> Pulling a memory stick and seeing the problem go away does not imply the stick you pulled is faulty.
> ...

 

Thanks again. Seems like we're pointing to RAM initially then. I think I'll try some tar'ing/untar'ing to see if I can reproduce this. Then jump to removing RAM.. if still nothing, I'll crank on memtest tonight with all the sticks in there.

You mentioned CRC checks.. is there any way I can verify any CRC errors. hdparm?? I'm not too familiar with the process of testing if the drive is okay.

Thanks for the help..it's much appreciated.

hanji

----------

## NeddySeagoon

hanj,

The CRC checks are all internel to the drive - its more complex than just CRCs because the drive is able to recover so many bad bits in a sector too.

If you have an IDE drive, the easiest way to see if its ok is to ask it with smartmontools.

SATA drives support SMART too but libsata in the kernel doesn't yet. There is a patch but I don't think its in the vanillia kernel yet.

All modern drives hide bad sectors from the OS by 'on the fly' remapping to spares, so you should never see a bad sector until the drive is end of life. You can dd the entire drive to /dev/null to read the surface. It will stop on errors.

I think its either memory, PSU (as in the PSU box) or the Vcore PSU on the motherboard.

----------

## troymc

You can try enabling Machine Check Exception support for you processor in your kernel. The hardware tracks it's own errors in a special log.  This will catch cpu, memory and other mainboard-related errors. 

Once you have support enabled in the kernel, you will see messages in /var/log/messages stating something like "a machine check exception was logged" if your hardware detects an error. You can install app-admin/mcelog to view these errors.

troymc

----------

## NeddySeagoon

troymc,

Thats worth a try, however many hardware errors prevent the log being written, so no errors in the log does not mean the hardware is ok.

----------

## hanj

Hello

I received a different error today.. just wanted to post it in case it provides an additional clue...

```
Unable to handle kernel NULL pointer dereference at virtual address 00000078

 printing eip:

c0127f54

*pdg =    0

*pmd =   0

Recursive die() failure, output supressed

 <0>Kernel panic - not synching: Fatal exception in interrupt
```

This time it happened at the configure stage when it was checking for compile options. I blew out the computer real good the other day and things weren't giving me a problem until this morning. I'll pop out a stick RAM today, and start that test. I want to do one thing at a time.

 *Quote:*   

> You can try enabling Machine Check Exception support for you processor in your kernel. The hardware tracks it's own errors in a special log. This will catch cpu, memory and other mainboard-related errors. 

 

I'll also do this too. 

Thanks everyone.

hanji

----------

## hanj

 *troymc wrote:*   

> You can try enabling Machine Check Exception support for you processor in your kernel. The hardware tracks it's own errors in a special log.  This will catch cpu, memory and other mainboard-related errors. 
> 
> Once you have support enabled in the kernel, you will see messages in /var/log/messages stating something like "a machine check exception was logged" if your hardware detects an error. You can install app-admin/mcelog to view these errors.
> 
> troymc

 

hmm. I just checked my kernel config, and I already had that built in. I grep'd the logs.. and I just see it 'enabled' on the CPU, but no errors.

```
Intel machine check architecture supported.

Intel machine check reporting enabled on CPU#0
```

Bummer.

hanji

----------

## hanj

Hello

36 hours and 114 pass w/no errors, I had to shut off memtest. Does this mean that memory is okay.. or do I have to completely finish the test? I can't believe it's taking this long?

I'm going to reseating RAM next

Thanks!

hanji

----------

## drescherjm

 *Quote:*   

> or do I have to completely finish the test?

 

It is a continuous test so it ends when you feel that you have waited long enough...

----------

## mope

Did you ever pinppoint the problem?

I'm getting the same thing on my presario 1710nx.

I'm on 2.6.18-r1 kernel.

I'll check memtest this morning and report back this afternoon, but maybe it's due to heat and the stock heatsink/fan?

----------

## lynnlinux

hi,All,after i installed the base gentoo system(2006.0),i begin to emerge kde. 

in this long time,an error happended when building kdelibs 

the error message is as title 

"Kernel panic - not syncing : Fatal exception in interrupt " 

thank you

----------

## didymos

Was there no more to the message?  Sounds like it might be a disk error, but without more info, I couldn't say.  Have you tried to build kdelibs again?

----------

## nixnut

merged above two posts here.

----------

