# Random kernel panics

## Heissi

I recently have problems with random kernel panics on my server/router.

Some background information:

It is a EPIA PD with a VIA C3 CPU.

On it I only installed few daemons and there is nothing special.

At first, the server just froze and I recognized that the CPU fan isn't moving anymore. So I replaced the fan and the hardware seems to be ok, but then there were these kernel panics at random times (10-200 minutes after boot).

Also the BIOS battery was low on voltage (BIOS settings and system clock were resetted), so I replaced it, but the kernel panics were still there.

Then I tested the CPU with cpuburn and some emerge, but... nope... It happened after the emerge process (no CPU load there).

The RAM seems to be ok too (memtest86+).

Finally I replaced the harddisk (I cloned the system) but that didn't resolve anything either.

I don't think the kernel is broken, because the system was running 90 days without any problems.

What should I do now? 

I'm really inexperienced with kernel panics.

Is there a way to trace back the source of the problem (maybe the mainboard)?

I don't know which informations of the system are relevant, so if you need informations, just ask.

Thanks.

----------

## pathfinder

try recompiling the kernel from the config file.

was your config file changed lately?

backup it, then cd usr/src/linux and make menuconfig

you can t boot on your computer, isn t it?

maybe it is due to the clock because in the handbook I think i remember that when you had to compile your kernel for the first time, there was a warning saying taht you ought to be sure the date is correct before proceeding. Maybe the fact your date was not ok made a huge mess.

I would definitely try to set the correct date now the cell has been changed, and then recompile as it it now your kernel. just to see what happens.

----------

## Heissi

I upgraded from hardened-sources-2.6.23 to hardened-sources-2.6.23-r7.

While I was testing something I got this message:

```
invalid opcode: 0000 [#1]

Modules linked in: thermal button processor

CPU:    0

EIP:    0060:[<c04e0739>]    Not tainted VLI

EFLAGS: 00010002   (2.6.23-hardened-r7 #1)

EIP is at elv_rb_add+0x1/0x51

eax: ddf9bd64   ebx: ddf9bd4c   ecx: ddf9bd4c   edx: d231be84

esi: ddf8f9c0   edi: d231be84   ebp: 00000000   esp: c6887b3c

ds: 007b   es: 007b   fs: 0000  gs: 0033  ss: 0068

Process mconf (pid: 29677, ti=c6886000 task=c4bf8ab0 task.ti=c6886000)

Stack: ddf8f9c0 c04e8772 d231be84 ddf9bd4c ddf8f9c0 c04e9808 d231be84 ddf92ad0 

       00000008 c04e0b22 ddf92b30 0005ffbe 00000086 d231be84 ddf92ad0 00000008 

       00000000 c04e3c8e 00000000 00000000 d231be84 c14512a0 c14512a0 00000008 

Call Trace:

 [<c04e8772>] cfq_add_rq_rb+0x3c/0x74

 [<c04e9808>] cfq_insert_request+0x1c/0x3a

 [<c04e0b22>] elv_insert+0xa4/0x141

 [<c04e3c8e>] __make_request+0x28c/0x2b6

 [<c04e3eb0>] generic_make_request+0x17e/0x1ab

 [<c046af47>] bio_add_page+0x31/0x37

 [<c046dad8>] mpage_end_io_read+0x0/0x5e

 [<c04e3f82>] submit_bio+0xa5/0xac

 [<c046dad8>] mpage_end_io_read+0x0/0x5e

 [<c046dbaf>] mpage_bio_submit+0x19/0x1d

 [<c046e104>] mpage_readpages+0x10f/0x11c

 [<c04852d4>] ext3_get_block+0x0/0xbe

 [<c05f82fd>] io_schedule+0xe/0x16

 [<c05f8421>] __wait_on_bit+0x4a/0x51

 [<c05f8496>] out_of_line_wait_on_bit+0x6e/0x76

 [<c0467d3f>] sync_buffer+0x0/0x2e

 [<c0439ffd>] buffered_rmqueue+0xbf/0xd7

 [<c043bc31>] read_pages+0x28/0xd3

 [<c04852d4>] ext3_get_block+0x0/0xbe

 [<c043a1b6>] __alloc_pages+0x51/0x2a4

 [<c043bde5>] __do_page_cache_readahead+0x109/0x123

 [<c043bef4>] ra_submit+0x20/0x25

 [<c043c054>] page_cache_sync_readahead+0x2a/0x2f

 [<c043703d>] do_generic_mapping_read+0xda/0x3ff

 [<c04375c7>] generic_file_aio_read+0x11f/0x14a

 [<c0437362>] file_read_actor+0x0/0xda

 [<c044ea5b>] do_sync_read+0xbe/0xfb

 [<c0423a42>] autoremove_wake_function+0x0/0x33

 [<c04106f4>] do_page_fault+0x2a7/0x5c7

 [<c044e282>] nameidata_to_filp+0x23/0x32

 [<c044eb21>] vfs_read+0x89/0x104

 [<c044edde>] sys_read+0x41/0x67

 [<c0403c9d>] sysenter_past_esp+0x66/0x99

 [<c0403cb6>] sysenter_past_esp+0x7f/0x99

 =======================

Code: 48 04 c7 42 3c 00 00 00 00 c7 43 04 00 00 00 00 eb 0e 8b 42 24 03 42 1c 39 f0 75 04 89 d0 eb 06 89 f8 eb a9 31 c0 5b 5e 5f c3 56 <89> c1 89 c6 53 31 db 83 38 00 74 22 8b 19 8d 4b bc 8b 41 1c 39 
```

Then I had to reboot, because the system was screwed up (like 10 defunct processes).

Looks like an Memory or CPU error, doesn't it?

But I trust memtest86+ and the radiator of the CPU wasn't really hot (why there isn't a sensor on the CPU?) so I removed the heat-conductive paste and put on some new one - just to be sure.

Unfortunately recompiling the kernel doesn't solve the problem.

----------

## pathfinder

well, have you tried with another distro? with windows?

just to detect whether it is an hardware problem, or software related?

cat /proc/cpuinfo gices you something?

try to see cat /proc/whatever just to get some extra info.

Also dmesg might say something, and the /var/log/messages.

I can t really tell you anything else right now.

----------

## Heissi

 *pathfinder wrote:*   

> well, have you tried with another distro? with windows?
> 
> just to detect whether it is an hardware problem, or software related?
> 
> cat /proc/cpuinfo gices you something?
> ...

 

I tried to install windows (I installed it before, so it has to work) and i got a bluescreen.  Some interrupt error (IRQL_NOT_LESS_...).

The kernel panic message was similar to this (interrupt exception).

So i can't do anything but buy a new mini-itx mainboard, right?

----------

## pathfinder

well, that looks like a hard hardware failure... :S

can t really tell you what.

is your Mobo guaranteed? could be useful here...

----------

## gundelgauk

Yes, sounds like faulty hardware. Since you already ruled out RAM and hard drive, it could be the CPU or mainboard. You said yourself that the first time your system froze was when the CPU fan died. Maybe the processor took some damage when that happened.

Apart from that: memtest showing no errors can not guarantee that your RAM is 100% OK. If it does show errors, your RAM is faulty. But it doesn't work the other way round. It might be that your RAM only produces errors when a very specific pattern gets written (or read) to a very specific address. And if memtest does not test exactly this pattern, no error will show up but you still have faulty RAM.

----------

