# Random complete lockups when compiling: kernel panic.

## Mamour

Hi folks,

Gentoo has been quite an amazing experience for me as a newbie Linux user, and I've been nothing but impressed with it  :Smile: . However, since the beginning, it seems a curse was cast upon my computer, and I get the not so delectable honor to get random lockups whenever it's under heavy stress (read: while compiling huge packages, such as wine, mozilla, or xorg-x11).

For quite a while, I was completely at a loss as to what caused these nasty crashes, since I would usually be under X at the time, so I could only resign myself to shut down the machine manually. Today, I decided to try compiling one of my lucky charm packages with X closed, using nothing but the console. And, lo and behold, I was able to hear the last gasp of my kernel before it finally gave in, yet again.

I have no idea of what to do with all the stuff it spat out, so I'm hoping someone here will be able to enlighten me, or even tell me how to get rid of this issue  :Smile: .

```
:Oops: 0002 [#1]

PREEMPT

Modules linked in: sd_mod nvidia

CPU: 0

EIP:  0060:[<c01072bf>]          Tainted:   P      VLI

EFLAGS: 00010002   (2.6.11.6)

EIP is at timer_interrupt+0x3f/0x140

eax: 00000080  ebx: d0960000  ecx: 00000001  edx: 00000001

esi: d0961fc4  edi: 00000001  ebp: 00000000  esp: d0961f60

ds: 007b  es: 007b  ss: 0068

Process cc1plus (pid: 25452, threadinfo=d0960000  task=d17f10c0)

Stack:  00000001  c048a760  00000000  d0961fc4  c0132c60  00000000  00000000  d0961fc4

        00000000  00000000  d0960000  c051fa00  d0960000  c0132d72  00000000  c011ac3d

        c0563da8  c048a760  d0961fc4  0000005d  b44b670c  00000000  bfffc068  c0104ad9

Call Trace:

  [<c0132c60>] handle_IRQ_event+0x30/0x70

  [<c0132d72>] __do_IRQ+0xd2/0x160

  [<c011ac3d>] __do_softirq+0x7d/0x90

  [<c0104ad9>] do_IRQ+0x19/0x30

  [<c010309e>] common_interrupt+0x1a/0x20

Code: ff 05 48 40 56 c0 a1 a8 8d 48 c0 ff 50 04 8b 3d 90 a6 55 c0 85 ff 74 1a ff 43 14 b8 0c 00 00 00 e6 20 e4 20 8b 43 08 ff 4b 14 a8 <08> 0f 85 e5 00 00 00 89 34 24 e8 e2 7c 01 00 31 c0 f6 46 32 02

<0> Kernel panic - not syncing: Fatal exception in interrupt

```

There, that's all I could gather, since I couldn't switch to the actual virtual console where the compilation took place. I hope this helps anyway, and that someone'll be able to lend me a hand.  :Smile: 

Any help or tips much appreciated!

Cheers!

EDIT: I forgot to add that my RAM seems to be doing all right, no errors turned up after a few passes with Memtest86+, so that's probably ruled out.

----------

## aschueler

Just a noob myself, but if I may add:  I have had similar problems with memory problems, so I wouldn't rule it out quite yet.  If may take more than just several passes of Memtest to find it.  Also, your system +/- hard drives could be overheating since it seems to occur at heavy loads.

----------

## BitJam

Such crashes are typical on overclocked systems.  If you are overclocking, you need to stop doing it while emerging.   If you have any doubts, search the forums here for "overclocking".

It certainly sounds like a hardware problem.   If it always crashes in exactly the same place in exactly the same way with exactly the same error message then it is almost certainly a software problem.   It things are different each time then it is likely a hardware problem.

If you are not overclocking, is it possible that your system is overheating?  I think this is the 2nd most likely cause of such crashes (overclocking is the first).

----------

## Mamour

aschueler, I'll try running additionnal tests on my RAM today.

BitJam, I'm not overclocking my system at all. I'm running on an Athlon XP 1800+, which (according to lm_sensors) runs at about 65°C/149°F under high loads, and 35°C/95°F when idle (using athcool). Would that be considered overheating?

I can't tell whether the crashes have always occured in the same place, as I've only seen the kernel messages once. They aren't logged either, since I guess the system gets killed before metalog can kick in. I'll try reproducing the errors, to see if anything else comes up.

Thanks for your interest!  :Very Happy: 

----------

## aschueler

Did a quick google search for optimal temps for the Athlon 1800, and found, interestingly, that it does not contain a temperature sensor on the chip itself.  The max operating temp is in the 90 degree C range, but some people have reported suspeected problems close to the temp range you are reporting.  If this is a system you built yourself, you shouldn't have difficulty with a $10 fan in the front of the case, and it may be all you need.  My setup has a an intake fan, two exhaust fans, and two great big Zalman fans -- one over the CPU and one over the video card.  It's surprisingly quiet when you buy bigger fans.

A quick way to test the temperature idea is to open the case and let it run open while you put it under a heavy load.  Also, you may want to check for dust etc on the cooling fans you already have and blow those off.

I am now officially out of idears  :Smile: 

----------

## Mamour

My case always run open, so there's nothing to worry about  :Wink: .

OK, here's the rundown for today's tests. I had Memtest run for about 5 hours, I think it ran 7 passes without finding any error. At this point, I can only guess that it's highly improbable that the RAM's the culprit.

Afterwards, I tried compiling Mozilla again, and I was quite surprised to see that not only did it crash, but the crash looked virtually the same as the one I posted above (the program was probably running elsewhere in memory, which caused some stack and register contents to be different). The crash occured at the same place (EIP at timer_interrupt+0x3f/0x140).

Is there any chance this is a kernel bug of sorts? It's too much like yesterday's crash, so I guess it's not _that_ random after all! Could that rule out the faulty hardware theory?

I'll keep you posted if I find anything else.

----------

## BitJam

Yes, this is beginning to look like a kernel bug.  I did a google-groups search for "EIP timer_interrupt" (no quotes) and it pointed to some interesting threads that may give you some hints on how to track down and fix the problem.

Certainly one thing to try is to boot off the LiveCD, play the chroot trick and then try doing the same emerge using the kernel from the LiveCD.  If the problem goes away then the poblem is almost certainly with your kernel.

----------

## Mamour

Okay, I tried something new, namely being in the same virtual console as the compilation process itself, and I got a slightly different message this time around (I've skipped the stack contents):

```
:Oops: 0002 [#1]

PREEMPT

Modules linked in: sd_mod nvidia

CPU: 0

EIP:  0060:[<4911eca0>]          Tainted:   P      VLI

EFLAGS: 00010002   (2.6.11.6)

EIP is at 0x4911eca0

eax: 00000001  ebx: dbfc6000  ecx: 00000000  edx: 00000000

esi: dbfc7fc4  edi: 00000001  ebp: 00000000  esp: dbfc7f5c

ds: 007b  es: 007b  ss: 0068

Process cc1plus (...)

Stack: (...)

Call Trace:

  [<c01072e9>] timer_interrupt+0x69/0x140

  [<c0132c60>] handle_IRQ_event+0x30/0x70

  [<c0132d72>] __do_IRQ+0xd2/0x160

  [<c011ac3d>] __do_softirq+0x7d/0x90

  [<c0104ad9>] do_IRQ+0x19/0x30

  [<c010309e>] common_interrupt+0x1a/0x20

Code: Bad EIP value

<0> Kernel panic - not syncing: Fatal exception in interrupt

```

Oddly enough, the EIP value seems to have been somehow corrupted this time, and timer_interrupt now appears in the call trace. Not that I know what difference that makes, but anyway, something did change. It looks like this problem can't be solved... I'll just wait patiently for the next kernel, and hope all my problems go away. Or was that booze?

BitJam: Unfortunately, I don't have a LiveCD on hand to be able run your suggested test, so I guess I'll just wait for 2.6.12 to be released, and see if it does any difference.

----------

## Moloch

I looks like a possible issue with preempt and nvidia. I would recommend recompiling your kernel with preempt off. Then of course recompile the nvidia drivers. In generel preempt always seems to cause problems.

----------

## Mamour

Moloch, I compiled a 2.6.11.7 kernel without preemption, and Mozilla just compiled succesfully. Looks like your solution works like a charm!  :Very Happy: 

I guess I should also try compiling it with preemption but without the nvidia module loaded, but I'm happy it just works, so I'll leave that for another day. Do you think I should report this issue to the kernel newsgroup or something?

Thanks a lot!  :Wink: 

----------

