# kernel 2.6.18 and 2.6.19 = unstable rubbish [SOLVED sortof]

## PantsMan

what is with the kernel these days its hopeless. 

I run a pretty standard system - intel P4, intel motherboard, 1 GB RAM, nvidia 6600GT, xorg-server-1.1.1-r3 , nvidia-drivers-1.0.9746 (9723, 9731 the same), gcc-4.1.1 etc.

I've had my Gentoo system running sweet for 3 years. So I'm no n00b to Gentoo. 

But now all i get is hard lock after hard lock - with 2.6.18-r2, 18-r6, 19-r2 etc. They have happened in a variety of situations, even when I'm not doing much. 

But lockups are much more frequent in this scenario:

I have mplayer running on my tv head, 

and p2p download coming in, and then,

on my main screen I switch virtual desktops in any window manager (e17, fluxbox, kwin, it doesnt matter).

Im getting fully sick of it. I never used to have these problems before I upgraded from 2.6.12, and xorg 6.8 etc. And now, gentoo-sources kernels prior to 2.6.18 have been removed from portage. I know portage is already bloated enough and we don't want too much old cruft left in, but, for something as important as the kernel - wouldnt it be nice to have a few more options rather than just 2.6.18 and 2.6.19 ?????? I spose I'm expected to go use vanilla-sources or something... its ridiculous.

People will probably also say that its an nvidia bug - but thats is rubbish. I've also had these lockups with open source nv driver (running 1 head, as nv won't drive my tv head). And I've disabled every fancy nvidia option in xorg.conf - including render acceleration. 

So, some sort of kernel bug is being exposed here... and it doesnt seem to be getting fixed... and it shouldn't be up to me to waste my time fixing it. If every Linux user has to waste their life debugging kernel panics, Gentoo, and Linux in general, is in trouble...

Ps please don't feel the need to reply to this post unless u want to agree with me. I know no-one will give a crap, unless they have this problem. Please allow me to vent my spleen in peace.Last edited by PantsMan on Sun Jan 07, 2007 5:54 pm; edited 1 time in total

----------

## pvangarde

It's a mystery, I can only successfully run ndiswrapper for my wireless on .17.

You've probably checked this, but does dmesg report anything weird?

----------

## PaulBredbury

 *PantsMan wrote:*   

> I've had my Gentoo system running sweet for 3 years.

 

Then revert back to what worked, and please don't moan about being unwilling to use your own local overlay to achieve that. It's all available online.

----------

## madisonicus

1) Slow down a bit.... You're obviously very frustrated.  No one here is trying to break your comp.  Also, Linux kernels aren't important enough to raise your blood pressure.

2) No need to upgrade if things are working for you.  You can always grab the 2.6.12 sources from kernel.org and use them as long as you want.  Being culled from portage is no reason to stop using something (see xmms).

3) Spleen-venting here: https://forums.gentoo.org/viewforum-f-7.html

3) A lot has changed between 2.6.12 and 2.6.19, if you'd like help going through your hardware and the new kernel options then all you have to do is ask.

4) Have you double-checked heat and memory problems?  Older systems often develop problems with clogged vents and defunct RAM chips.

-m

----------

## erik258

i bet you don't even need an overlay, just set up /etc/portage/package.* to use a certain kernel version.  

for the record, nvidia+Xorg 7.1.1 + amd64 works fine for me with 2.6.18-r[3-6].  I would wager you're running a dual-core box and if so, yes, you're right, the kernel's at fault and you outght to downgrade.

----------

## PantsMan

 *erik258 wrote:*   

>  I would wager you're running a dual-core box and if so, yes, you're right, the kernel's at fault and you outght to downgrade.

 

Well, I'm not quite running a dual core but i am running a P4 with hyperthreading on which appears as two CPUs !

Erik, would you please provide some more information on what kernel problem is causing these nightmares so I can switch off hyperthreading or downgrade to the appropriate kernel, track the problem with the SMP or whatever until it is resolved, and then upgrade? It would be greatly appreciated.

ps I do apologise to all for the tone of my email, I admit i have a tendency to shoot my mouth off without thinking, sorry. I'll take Madisonicus' advice too, and vent my spleen to the chat forum next time  :Smile: 

Thanks for the responses.

----------

## PantsMan

 *pvangarde wrote:*   

> You've probably checked this, but does dmesg report anything weird?

 

Unfortunately not - when it locks it locks suddenly, nothing goes into logs, I cant switch to console, cannot even ssh into the box - so the bug is causing a full kernel panic and the system is completely gone. All i can do is the ctrl-alt-prtscrn rseiub to sync disks etc and reboot.

----------

## erik258

there's been a few posts on the forums as to random X lockups on SMP boxes, mostly i've seen dual-core amd64 boxes doing it.  I'm not sure whether it's their mistake or the kernel's, but i will say that I had the same problems on an old dual-p2 box i had running X for a while.  The random lockups went away when I stopped running X on it and converted it to a webserver ; )

from http://www.cyberciti.biz/howto/question/static/linux-kernel-parameters.php :

        nosmp           [SMP] Tells an SMP kernel to act as a UP kernel.

try passing that option when booting the kernel; it will probably interfere with hyperthreading, but see if it works better.  if not, you can happily ignore my post and this ugly problem.

----------

## PantsMan

 *madisonicus wrote:*   

> 
> 
> 2) No need to upgrade if things are working for you.  You can always grab the 2.6.12 sources from kernel.org and use them as long as you want.  Being culled from portage is no reason to stop using something (see xmms).
> 
> 4) Have you double-checked heat and memory problems?  Older systems often develop problems with clogged vents and defunct RAM chips.
> ...

 

Yep, I've cleaned the fans on 6600GT and CPU, and Ive got my case off and a fan blowing on the pc, so I'm pretty confident its not heat. I'm a smoker, and tar in the air causes fan blades to clog up badly so cleaning fans was one of the first things I did. As for RAM or other hardware problems (eg aging capacitors) i just dont think so. The problems happened immediately on upgrade to new kernel/xorg.

The reason i posted here is that I have Googled all over the internet and not found anyone with similar problems. There are some people having problems, but not with nice safe x86 (32 bit), on good hardware. I'm not using x86-64 or amd-64 or any bleeding edge or rare hardware. And 2.6.18 isnt the bleeding edge of kernels now anyway. It's up to -r6, and we now have 2.6.19-r2 and 2.6.20 (in mm-sources) 

I am aware that I can create my own portage overlay - but - what I mean is... I shouldnt have to. If portage only has 2.6.18 and 2.6.19 (gentoo sources) - then thats all I should need because at least one of those 2 options should be pretty much rock solid on standard hardware. Instead, theyre both completely flaky, and yet no-one seems to post any problems with them!

Also, take for example, what nvidia says at: http://www.nvnews.net/vbulletin/showthread.php?t=58498

That thread is not much use to me, as the kernel bugs discussed there are old, and nvidia advise upgrading to newer kernels to avoid them, not revert to old ones...

If there IS a problem with these kernels and dual-core systems like Erik suggested - surely this is pretty major! Dual-core is the current standard... and it does not make sense to tell people with a brand new dual-core processor - to switch off SMP and waste 1 core on their processor, or, use an old kernel in order to avoid some kernel bug. It's just unfeasible, because the old kernels won't have support for the new motherboard's wizz-bang new gizmo sata controller or audio chip, or whatever. What hope will there be for a Linux kernel that cant run SMP when quad-core processors are the norm (and that isnt far away).

Im just extremely disappointed. It looks to me like stability is being thrown out the window with 2.6 series kernels.

Linux is now the new windows - but worse, because the kernel is now so buggy it wont even give you a nice blue screen of death with some information on it. It just completely locks without providing ANY information as to why (unless you go to unusual lengths, like having console messages and kernel printk's logged out the serial port to another pc)

If i may revive the monolithic kenel vs microkernel debate, I think this is proof that we might as well stop wasting time on rubbish monolithic kernels like Linux. Its time to throw Linux in the bin and switch to a microkernel OS like MacOS 10.

Oops, I've started my spleen-venting again, sorry. I'll shutup now.  :Sad: 

----------

## PantsMan

 *erik258 wrote:*   

> there's been a few posts on the forums as to random X lockups on SMP boxes, mostly i've seen dual-core amd64 boxes doing it.  I'm not sure whether it's their mistake or the kernel's, but i will say that I had the same problems on an old dual-p2 box i had running X for a while.  The random lockups went away when I stopped running X on it and converted it to a webserver ; )
> 
> from http://www.cyberciti.biz/howto/question/static/linux-kernel-parameters.php :
> 
>         nosmp           [SMP] Tells an SMP kernel to act as a UP kernel.
> ...

 

LOL - yeah lots of problems go away when you dont load X or nvidia modules  :Smile:  But that doesnt (necessarily) mean its the fault of X, or nvidia... its the bloody kernel. 

Ok yep I will try disabling the hyperthreading in BIOS and use nosmp to switch kernel to UP. Or go  back to old kernel.

That wont be much of a solution for a dual-core box though, as that would mean 1 core would be wasted though, surely. Or 3 out of 4 cores wasted on a quad-core processor, when they become common...  I wish someone would be so kind as to just find the bug and fix it, instead of pretending its not there...

----------

## timeBandit

A suggestion, for the continued health of your spleen  :Smile: : In future, when updating kernel sources, never ever unmerge your current kernel source tree until you've configured, built and tested the new one to your satisfaction. Then it won't matter if you have to roll back and it's gone from Portage. Plus, since Portage doesn't actually build the kernel, even if it leaves the tree it has to get seriously out of date (e.g., your profile becomes deprecated) to become a problem.

Please forgive if I've misunderstood you--I couldn't quite tell whether you were upset on general version management principles, or specifically that you needed to re-fetch a kernel that was retired "too soon."

----------

## erik258

 *Quote:*   

> Instead, theyre both completely flaky, and yet no-one seems to post any problems with them! 

 

Your assumption is that theyre both completely flaky even though no-one seems to post any problems with them... maybe no-one has problems with them?  I certainly don't.  I don't have a p4 with hyperthreading either.  

 *Quote:*   

> 
> 
> its the bloody kernel.

 

Another assumption.  X is just as complicated as the kernel.  And compared with X, the kernel makes up a small part of system activities.  Not that it couldn't be the kernel.

----------

## PantsMan

 *erik258 wrote:*   

> 
> 
> Your assumption is that theyre both completely flaky even though no-one seems to post any problems with them... maybe no-one has problems with them?  I certainly don't.  I don't have a p4 with hyperthreading either.  
> 
>  *Quote:*   
> ...

 

Well, I hope my post can bring some more people who are having problems with these kernels out of the woodwork. I can't be the only one.

Also, again, - when this happens, I cannot even ssh into the box anymore. No Xvnc either. Nothing. No pulse. Its dead. 

So the reason I say its the bloody kernel is, if its just an X crash, there should be no way for it to take the whole system including the kernel with it. If it does, there must be a kernel bug involved, not (just) an X bug. Of course, it could be the nvidia kernel module which actually contains the bug... but it also seems to happen with the standard kernel nv module (although it is more difficult to reproduce). There must be something more than just an nvidia bug here.

So... I don't know, I'd love to help find this bug and fix it, but I just can't afford to waste time on this. I was hoping someone would tell me it is a known bug, so I can just fall back to an old kernel and wait for it to be fixed. Oh well, perhaps I can narrow things down a bit more and provide enough information that someone will be able to see the answer quickly.

I need to do some further investigation obviously, before I waste any more of anyone else's time.

thanks again for the suggestions ppl.

----------

## erik258

best of luck!

----------

## wuzzerd

I had problems with these until I got my .config right.   :Very Happy: 

----------

## baigsabeeh

I just started having the problem.  I moved down to X86 from X64 due to conveniences that X86 has for the time being.  I used the Conrad install guide and now every kernel I've tried is basically dead, but then again, I did try them from the custom-kernels overlay.  I think I'm going to just go and do a Gentoo install and then go back to the Gentoo kernel as that is what worked, and no more Reiser4 for me.

----------

## ly-cilph

I've been having a similar problem. I just bought a second usb hard drive cradle and ide drive to fit in it, after my first one working so well. After my main hard drive dying from what I think is running xfs on an ide partition I switched to reiserfs on my / partition and on the new hard drive cradle. 

I've got about 200G on it at the moment and it seems fine, but now when I try to put more files on there (500M or so each) the system hangs (define hanging as no data being transferred, a watch on ls -l on another terminal window stops updating, opening a new xterm and trying to ls the directory hangs, but switching to another desktop works until the firefox window won't update). 

top tells me that my cpu and ram is normal.

Copying the same file from one reiser to another reiser partition on my main drive works quickly and perfectly. 

I'm wondering if it's something to do with the usb-to-ide driver or whether there's some problem with reiser and large filesizes (most files on the drive being 100M or more). Of course, it could be a dodgy drive.

Any ideas?

Oh, and I'm running an athlon xp 2600 with 1G ram which is three years old.

Edit: Just tried copying a file from ide drive to usb drive with xfs partition, this hangs as well. Oh well, it looks like it might be a usb kernel issue after all.

----------

## joaander

I don't know if https://forums.gentoo.org/viewtopic-t-364637-highlight-.html bug is still in the latest versions of the kernel, but the last time I took out the setting kernel.randomize_va_space=0 my system crashed within minutes.

----------

## net-0

I'm not sure if it's related but it sounds like the issue I'm having. I just did a fresh install and downloaded mostly uptodate packages... and I'm getting lockups.... 

I have a athlon 3800x2, nforce 4, and nvidia 7900gs... and it seems it randomly locksup....

I have been looking through the forums and jokingly found this one search for lockups   :Shocked: 

For giggles here is my emerge info

```

Portage 2.1.2_rc4-r3 (default-linux/x86/no-nptl, gcc-4.1.1, glibc-2.5-r0, 2.6.18-gentoo-r6 i686)

=================================================================

System uname: 2.6.18-gentoo-r6 i686 AMD Athlon(tm) 64 X2 Dual Core Processor 3800+

Gentoo Base System version 1.12.8

Last Sync: Wed, 03 Jan 2007 19:30:01 +0000

dev-lang/python:     2.4.4

dev-python/pycrypto: 2.0.1-r5

sys-apps/sandbox:    1.2.18.1

sys-devel/autoconf:  2.13, 2.61

sys-devel/automake:  1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r2, 1.10

sys-devel/binutils:  2.17

sys-devel/gcc-config: 1.3.14

sys-devel/libtool:   1.5.22

virtual/os-headers:  2.6.19

ACCEPT_KEYWORDS="x86 ~x86"

AUTOCLEAN="yes"

CBUILD="i686-pc-linux-gnu"

CFLAGS="-march=athlon64 -O2 -pipe -fomit-frame-pointer -msse3"

CHOST="i686-pc-linux-gnu"

CONFIG_PROTECT="/etc /usr/kde/3.5/env /usr/kde/3.5/share/config /usr/kde/3.5/shutdown /usr/share/X11/xkb /usr/share/config"

CONFIG_PROTECT_MASK="/etc/env.d /etc/gconf /etc/revdep-rebuild /etc/terminfo"

CXXFLAGS="-march=athlon64 -O2 -pipe -fomit-frame-pointer -msse3"

DISTDIR="/usr/portage/distfiles"

FEATURES="autoconfig distlocks metadata-transfer sandbox sfperms strict"

GENTOO_MIRRORS="http://distfiles.gentoo.org http://distro.ibiblio.org/pub/linux/distributions/gentoo"

MAKEOPTS="-j3"

PKGDIR="/usr/portage/packages"

PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --delete-after --stats --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages"

PORTAGE_TMPDIR="/var/tmp"

PORTDIR="/usr/portage"

SYNC="rsync://rsync.gentoo.org/gentoo-portage"

USE="3dnow 3dnowext X alsa apm arts automount berkdb bitmap-fonts cdr cdrom chroot cli cracklib crypt cups directfb dlloader dri dvbplayer dvd dvdr dvdread eds emboss encode fbsplash ffmpeg firefox foomaticdb fortran gaim gdbm gdm gif gimp gpm gstreamer gtk gtk2 iconv imlib ipv6 isdnlog jpeg kde libg++ libwww mad mikmod motif mp3 mpeg mplayer ncurses nls ntfs nvidia ogg opengl oss pam pcre perl png pppd python qt3 qt4 quicktime readline reflection sdl session spell spl ssl tcpd truetype truetype-fonts type1-fonts vim vncviewer vorbis webmin-minimal x86 xml xorg xscreensaver xterm xv xvid zlib" ALSA_CARDS="intel8x0" ALSA_PCM_PLUGINS="adpcm alaw asym copy dmix dshare dsnoop empty extplug file hooks iec958 ioplug ladspa lfloat linear meter mulaw multi null plug rate route share shm softvol" ELIBC="glibc" INPUT_DEVICES="keyboard mouse" KERNEL="linux" USERLAND="GNU" VIDEO_CARDS="nv nvidia vesa"

Unset:  CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, LANG, LC_ALL, LDFLAGS, LINGUAS, PORTAGE_RSYNC_EXTRA_OPTS, PORTDIR_OVERLAY

```

Im using kernel 2.6.18 which I heard is the problem... and upgrading to 2.6.19 can fix the lockups, so I have been trying to get 2.6.19 to work but for some reason when I try to boot it, it doesnt pickup my root partition.

Anyways I'll keep my eye on this thread... and good luck.

Also can someone post a soild configuration? I haven't really used gentoo in a while...

----------

## wuzzerd

Quoting myself, lol...

 *wuzzerd wrote:*   

> I had problems with these until I got my .config right.  

 

I take that back. 

About a year and a half ago I thrashed through freezing problems and random lines appearing on the screen and found the solution was to disable CONFIG_AGP in my .config.  This is no longer an option in [*]menuconfig.  When I edit .config by hand and comment it out or set it to N it automagically gets reset to Y when I run make.

Looking inside the box the AGP slot is still empty. 

Well, I've minimized AGP memory in the Bios and given my onboard vga a ton of ram and there have been no freezes, although programs like   links2 -g [graphics mode] leave little artifacts on the screen which can be erased by moving a window over them.   Kde apps like konqueror and konsole don't mess things up.

Food for thought for those of you more learned in the ins and outs of graphics drivers.

----------

## ly-cilph

I've rebooted into the 2.6.19r2 kernel set as close to my 2.6.18r3 as I can and copying seems to work ok now (no lockups over three trials)

Hopefully i didn't just luck out and it stops at 4 tries

----------

## net-0

I don't understand it... I can't get my kernel to load sata... so I can't mount my / partition.

I look through a walkthrough to this but no luck...

http://gentoo-wiki.com/HARDWARE_SATA

When I genkernel --menuconfig all and go to build sata into the kernel I don't see the option to do so... its in the scsi section but it doesnt exist...

I did this

```

Device Drivers  --->

  SCSI device support  ---> 

    <*> SCSI device support

    <*>   SCSI disk support

    <*>   SCSI generic support <<not needed!>>

```

however I can't do this 

```

Device Drivers --->

  SCSI device support --->

    SCSI low-level drivers --->

     [*] Serial ATA (SATA) support

     < >   ServerWorks Frodo / Apple K2 SATA support (EXPERIMENTAL)

     < >   Intel PIIX/ICH SATA support

     < >   NVIDIA SATA support

     < >   Promise SATA TX2/TX4 support

     < >   Promise SATA SX4 support

     < >   Silicon Image SATA support

     < >   SiS 964/180 SATA support

     < >   VIA SATA support

     < >   VITESSE VSC-7174 SATA support

```

Because in SCSI low-level-drivers --->

 This option doesnt exist for me.... [*] Serial ATA (SATA) support

Any ideas?

----------

## net-0

I don't see any of those options

```

[*] Serial ATA (SATA) support 

     < >   ServerWorks Frodo / Apple K2 SATA support (EXPERIMENTAL) 

     < >   Intel PIIX/ICH SATA support 

     < >   NVIDIA SATA support 

     < >   Promise SATA TX2/TX4 support 

     < >   Promise SATA SX4 support 

     < >   Silicon Image SATA support 

     < >   SiS 964/180 SATA support 

     < >   VIA SATA support 

     < >   VITESSE VSC-7174 SATA support 

```

I must have forgot to add something... any clues? Is there an options I need in my make.conf for sata?

----------

## aidanjt

SATA support in .18/.19 has been moved to Device Drivers->Serial ATA (prod) and Parallel ATA (experimental) drivers.

----------

## net-0

haha Im so oblivious to the obvious, it was right under scsi and I didn't notice.... thanks for the info.

----------

## aidanjt

I was scratching my head myself when I went into the SCSI menu :/

----------

## didymos

 *Quote:*   

> wouldnt it be nice to have a few more options rather than just 2.6.18 and 2.6.19 ?????? I spose I'm expected to go use vanilla-sources or something... its ridiculous.

 

Huh?  What are you talking about?

```

eix gentoo-sources:

[I] sys-kernel/gentoo-sources

     Available versions:

        (2.4.32-r7)     !2.4.32-r7

        (2.6.15-r1)     2.6.15-r1

        (2.6.16-r13)    2.6.16-r13

        (2.6.17-r8)     2.6.17-r8

        (2.6.17-r9)     (~)2.6.17-r9

        (2.6.18)        (~)2.6.18

        (2.6.18-r1)     (~)2.6.18-r1

        (2.6.18-r2)     2.6.18-r2

        (2.6.18-r3)     2.6.18-r3

        (2.6.18-r4)     2.6.18-r4

        (2.6.18-r5)     2.6.18-r5

        (2.6.18-r6)     2.6.18-r6

        (2.6.19)        (~)2.6.19

        (2.6.19-r1)     (~)2.6.19-r1

        (2.6.19-r2)     (~)2.6.19-r2

     Installed versions:  2.6.19-r2(2.6.19-r2)(06:28:13 PM 12/13/2006)(-build symlink)

     Homepage:            http://dev.gentoo.org/~dsd/genpatches

     Description:         Full sources including the Gentoo patchset for the 2.6 kernel tree

```

Yeah, no 2.6.12, but certainly more than just .18 and .19.

----------

## PantsMan

 *wuzzerd wrote:*   

> 
> 
> About a year and a half ago I thrashed through freezing problems and random lines appearing on the screen and found the solution was to disable CONFIG_AGP in my .config.  This is no longer an option in [*]menuconfig.  When I edit .config by hand and comment it out or set it to N it automagically gets reset to Y when I run make.
> 
> 

 

I also thought it might have something to do with agp - so i had a little look into that. But it looks ok. The only thing that puzzles me a little bit, is:

cat /proc/driver/nvidia/agp/status

Status:          Enabled

Driver:          NVIDIA

AGP Rate:        8x

Fast Writes:     Disabled

SBA:             Enabled

I'm using agpgart, not nvagp - but the above says Driver: NVIDIA.

I wouldve thought the proc entry for agp status should list agpgart as the driver, not NVIDIA... but it says NVIDIA. 

In any case, I have tested vanilla-sources 2.6.16.14 - and i am not getting any more of these crashes - so 2.6.16 looks good. I'll try 2.6.17 now.

----------

## PantsMan

 *didymos wrote:*   

> 
> 
> ```
> 
> [I] sys-kernel/gentoo-sources
> ...

 

hmm yeah. Im not sure where I got the idea gentoo sources only had .18 and .19 now. 

I mustve made a mistake there, ooops. Sorry. Thanks for pointing this out!

This does at least prove that it would be stupid if gentoo sources DID only have .18 and .19  :Wink: 

----------

## PantsMan

 *erik258 wrote:*   

> 
> 
>         nosmp           [SMP] Tells an SMP kernel to act as a UP kernel.
> 
> try passing that option when booting the kernel; it will probably interfere with hyperthreading, but see if it works better.  if not, you can happily ignore my post and this ugly problem.

 

Unfortunately that option was not good. Pc would not boot - it'd get just past identifying my ide controller, and then sit there giving "hda: lost interrupt" messages. It did this with hyperthreading enabled and disabled in the bios, it made no difference. Weird.

----------

## erik258

wow - didn't expect that -- but i guess the latter option was best for you from the beginning ; )

----------

## wuzzerd

Went back to 2.6.16-gentoo-r13.  No hangs yet, less trash on screen.  

One of my favorite things about gentoo is building kernels.  I never was successful at it before, now it's almost second nature.

----------

## PantsMan

As stated earlier, vanilla sources 2.6.16.14 looks good. 

But, I was easily able to cause crash on vanilla sources 2.6.17.14 

So it looks like the bug was introduced between 2.6.16 and 2.6.17

To narrow it down as much as possible I'll now test 2.6.16.36, which is the last of the 2.6.16 vanilla sources.

And if that is good, I'll try 2.6.17.9 which is the earliest of the 2.6.17 series still in portage.

----------

## PantsMan

doh!

no, 2.6.16.36 is bad, and I retested 16.14 and that is bad too.

running an emerge at the same time as mplayer on tv head, and then switching desktops makes these crashes even easier to reproduce.

So the bug has been arround since before 2.6.16.14. 

I'm starting to think this is an old kernel bug which has been around for ages, but which is only being triggered now by new xorg/nvidia/gcc

----------

## PantsMan

I've just gone back to linux-2.6.15-gentoo-r1 and the problem is still there.

And after having a look at the old kernels lying around in my /boot/ it turns out the kernel I was running before which was stable for 6 months was... linux-2.6.15-gentoo-r1  not 2.6.12 like I thought.

So it must be something to do with the upgrade I did to modular Xorg (from 6.8 to 7.1), new nvidia drivers, or new gcc (4.1.1).

Downgrading the nvidia drivers is going to be a bit dodgy, due to the ABI change, so i may have to downgrade Xorg to 6.8 and revert to old nvidia drivers... just to be sure its xorg/nvidia, and not gcc. 

I'm not downgrading gcc and recompiling my whole system/world again...

What a pain in the a$$.

----------

## wuzzerd

 *wuzzerd wrote:*   

> Went back to 2.6.16-gentoo-r13.  No hangs yet, less trash on screen.  
> 
> 

 

Took out the frame buffer stuff.  It may be time to check the xorg sis driver.   :Razz: 

----------

## josh0980

PantsMan:

For what it's worth, this is just my anectodal experience, but it adds to your theory:

I'm running an Acer laptop, Celeron M 1.5ghz, intel motherboard, i915gm video. The last/current stable current I've been using is 2.6.17-r2 of no-sources. Since then I've tried a 2.6.18 and 2.6.19-r6 of morph-sources, 2.6.18-rc2-r1 of no-sources, and (a supposedly stable) 2.6.18-r4 of gentoo-sources. I keep going back to .17.

Any .18 or .19 sources I've tried gives me random (once every few days or so) lockups; hard lockups, though not so hard that sysrq-alt-B won't reboot for me. But hard enough, occasionally, that sysrq-alt-S and -U will neither sync nor unmount my drives, leading to unclean filesystems; very, very bad, in my terms.

Last night (with the .19-morph6 patchset), it locked so hard that I figured my root filesystem was done for: it didn't just go through the replaying of the journal when it mounted, but it stalled before mounting for about 12 seconds (and it has never taken more than one second to do this)! Needless to say, I am as wary of post .17 kernels (though I'm giving 2.6.20-rc2-r1 of mm-sources a go).

(And, unlike a few of the above posters, these problems arose without adding any hardware or changing any kernel configurations whatsoever).

Just my two cents (and, in response to some other of the above posters, I acknowledge that using non-standard sources is risky. Yet, I think many are under the presumption that the "stable" gentoo-sources are meant to be stabler than this. (Of course, maybe I should try vanilla-sources, I concede).

Anyhow, good luck in your battle.

----------

## erik258

i have heard that you can't just cp .config, that you have to make oldconfig instead.  Anyone else?

----------

## PantsMan

 *erik258 wrote:*   

> i have heard that you can't just cp .config, that you have to make oldconfig instead.  Anyone else?

 

I always do make oldconfig when compiling a new kernel for the first time with a config ive taken from another kernel. I wouldnt expect it to work if I didnt, what with kernel codemonkeys changing the names of  config options and moving them around willy nilly.

So anyway. I tried using older 8762 drivers, with my xorg 7.1 Had to edit the ebuild to do it, and get Xorg to ignore the old ABI. Still the same crashes. So now i have reverted to Xorg 6.8.2

Its an absolute debacle, this whole situation... you guys have no idea of the paaiiiiiiin.... none of the nvidia driver ebuilds are set to work with the old xorg now. So i have to use nvidia's script directly and just hope things go where theyre supposed to. Whats more, I cant even do a simple thing like emerging a window manager, because, the ebuilds for those too, are set to use the new Xorg filesystem layout. It looks like all my KDE needs to be rebuilt too, as KDE apps are complainging about a missing libXdmcp.so.6 etc.

Argh. I wonder who came up with the idea of switching location of Xorg, AND modularising it, and breaking the ABI, all in one go... what genius.... lets change so much that its bound to cause problems, and then make it an absolute nightmare to go back to a stable config...

----------

## PantsMan

and, of course, i cant remerge kde without huge hassles because it has multiple dependencies on various bits of modular Xorg...

i think i can safely say that thanks to the way Xorg has been broken up, and the way many ebuilds have dependencies on various parts of it, it is now not even worth trying to revert to the old Xorg. Its just too much hassle.

So there we go.

I now have a broken system, and no prospect of fixing it without wasting my time fixing kernel/nvidia bugs.

Thanks very much Gentoo/Xorg/Nvidia.

----------

## PantsMan

ok, ive cheered up a bit. And decided i am not to be defeated so easily.

I'm using fluxbox and nvidia drivers 8762 at the moment with Xorg 6.8.2

But what i need is a window manager which has a pager - so i can switch virtual desktops quickly and see if I can replicate the crashes I was getting before.

So i have added 20 pesky modular xorg bits to package.provided and now i am able to remerge kde. Well, i'm able to start the emerge, who knows whether it will finish successfully.

All I want is Kpager really. Actually, e16 would do the job of providing a pager, and better wm. But kwin and konqueror would be nice. And I just can't live without kmail for very long...

----------

## PantsMan

ok, so now ive got kde recompiled, and im using kpager to switch virtual desktops rapidly while mplayer is going, and an emerge is happening, and... no crashes. I will test it a bit more, before I'm 100% positive, but, it seems rock solid again now.

So, I can conclude:

xorg 6.8.2 = good

xorg 7.1    = much bad kernel panic joojoo with all linux kernels >= 2.6.15

For anyone else running xorg 7.1 - who wishes to test whether they have this bug, the steps to reproduce are:

  1) You need an nvidia card with svideo out, as well as vga

  2) play video with mplayer on tv out

  3) load computer with something like an emerge, or glxgears on vga head also worked for me.

  4) use kpager, or enlightenment pager or something to switch virtual desktops rapidly on vga head.

  5) the crashes occur even with compositing off and all other fancy nvidia options like render acceleration off. Incidentally, I am not using Xinerama or Twinview, I have my tv and vga heads set up as completely independent X screens.

Sometimes it will lock up within just a few desktop switches. Usually it takes 10-30 seconds of switching before lock. Sometimes it manages to survive 30 seconds, but if so, click on some windows, rest your mouse finger for 20 seconds, and try again. Shouldnt be too long before it locks up, if you have this problem.

I've been able to reproduce the bug easily using:

Nvidia 6600GT 128 MB RAM, 

kernels 2.6.15 through to 2.6.19 (vanilla sources and gentoo sources), 

gcc 4.1.1, 

Xorg 7.1, 

nvidia-drivers 9723? through to 9746, and also with 8762, though I know 8762 has old ABI so shouldnt necessarily be expected to be stable with Xorg 7.1

(and mplayer 1.0_pre8-r1)

Bug also occurs with nv driver, on switching desktops - although it is much less frequent and I'm not sure how to trigger it reliably - as nv driver cannot drive tv head and vga simultaneously.

Bug seems at this stage to not be reproducible at all with Xorg 6.8.2 (and nvidia driver 8762).

----------

## PantsMan

bleagh, turns out the bug occurs almost instantly when i run glxgears and then switch virtual desktops, even in Xorg 6.8.2, without even having mplayer on tv head. All it takes is a few switches, and the system locks up. Interestingly, glxgears continues happily rotating away...

With Xorg 6.8.2 the system is definitely much more resistant to crashing when i switch virtual desktops on vga head with mplayer playing on tv head. I havent been able to make it crash that way yet. But switching desktops when glxgears is running on vga head causes crash very easily.

This is ridiculous...

Where is my Windows XP cd?

Perhaps it has something to do with this...

cat /proc/driver/nvidia/agp/status

Status:          Enabled

Driver:          NVIDIA

AGP Rate:        8x

Fast Writes:     Disabled

SBA:             Enabled

Ive got agpgart module loaded, and lsmod reports that it is used by nvidia module. But proc is reporting that the system is using NvAGP ... bizarre... surely it should say:

Driver:          AGPGART

X config has NvAGP 3 so it should be using agpgart if possible, and the agpgart module is loaded...

----------

## NoError

 *PantsMan wrote:*   

> All i can do is the ctrl-alt-prtscrn rseiub to sync disks etc and reboot.

 

Then I think it is not a hard lock, otherwise you can not even do that to sync the disks and reboot.

Maybe it's a good idea to build a debug kernel with the kernel debug options turned on. I think then with a next lockup you get more valuable information. So you can check if it is really the kernel or something else.

----------

## PantsMan

 *NoError wrote:*   

>  *PantsMan wrote:*   All i can do is the ctrl-alt-prtscrn rseiub to sync disks etc and reboot. 
> 
> Then I think it is not a hard lock, otherwise you can not even do that to sync the disks and reboot.
> 
> Maybe it's a good idea to build a debug kernel with the kernel debug options turned on. I think then with a next lockup you get more valuable information. So you can check if it is really the kernel or something else.

 

Well, whatever it is, its pretty severe. When I can't kill the X server, or switch to a vt/console, AND it kills ssh (and Xvnc) (or somehow makes them inaccessible) so I cant even ssh into the box, then its pretty bad.

So yeah, I spose I've got no option now but to compile some extra debugging info into the kernel and get a console visible so I can see what the kernel spits out when it croaks. Now I can trigger the crash using glxgears on vga head, i should be able to get a reasonable amount of console text displayed on my tv head.

Despite my petulant moments of frustration, I'm pretty determined to track this bug down  :Smile: 

----------

## timeBandit

 *PantsMan wrote:*   

>  *NoError wrote:*    *PantsMan wrote:*   All i can do is the ctrl-alt-prtscrn rseiub to sync disks etc and reboot. Then I think it is not a hard lock, otherwise you can not even do that to sync the disks and reboot. 
> 
> Well, whatever it is, its pretty severe. When I can't kill the X server, or switch to a vt/console, AND it kills ssh (and Xvnc) (or somehow makes them inaccessible) so I cant even ssh into the box, then its pretty bad.

 

If the kernel responds to Alt+SysRq+R/S/E/I/U/B, it will also respond to Alt+SysRq+K (SAK - Secure Access Key) which is where you should start. That will kill all processes bound to the current VT, and that includes X. This often works when Ctrl+Alt+Backspace does not.

You may see visual garbage, and it may take several seconds before you get a new login prompt as everything dies ands resets. But it should result in GDM/KDM eventually resetting and starting a new X server. If you don't use a DM, then a new getty will start. If the display stays scrambled more than 10 seconds or so, try the SAK combo again--sometimes it seems to take two attempts.

At the very least, this will kill the processes hogging the CPU--that's most likely why you can't get in via SSH. Something in X &/or the nvidia driver goes into a tight loop. If a couple attempts with Alt+SysRq+K don't reset the VT, connect via SSH then reload the nvidia module:

```
rmmod nvidia

modprobe nvidia

/etc/init.d/xdm restart
```

and obviously, skip that last if you don't use a DM. With these steps, I have never had to reboot the box to recover from one of these X hangs, and stability seems no worse for wear afterward.

For me, the trigger seems to be drag-and-drop activity in GNOME. It's been increasingly unstable with the last two X updates and is really beginning to piss me off, but I haven't had time to diagnose it precisely. Modular X is a bloody shambles.

I'm interested to see whether a debug kernel helps your diagnosis. AFAICS this results from X and the nvidia module not playing nice, so I'm hopeful.

----------

## PantsMan

 *timeBandit wrote:*   

> 
> 
> If the kernel responds to Alt+SysRq+R/S/E/I/U/B, it will also respond to Alt+SysRq+K (SAK - Secure Access Key) which is where you should start. That will kill all processes bound to the current VT, and that includes X. This often works when Ctrl+Alt+Backspace does not.
> 
> If a couple attempts with Alt+SysRq+K don't reset the VT, connect via SSH then reload the nvidia module:
> ...

 

Thanks for the excellent advice Timebandit - I don't think I've ever heard of the SAK before  :Smile: 

Unfortunately, it did not manage to restore a usable console. It did manage to kill X, but left the screen completely blank. I did it several times, waiting a while between attempts, then doing several in quick succession, but no change.

Anyway, i have checked through my kernel config, enabled more debug, and also, I found that I did not have intel chipset AGP support enabled, argh. A while ago I decided to remove agpgart from the kernel to see if NvAGP would improve things. Turns out, when I reenabled agpgart, I somehow missed reenabling intel chipset support - which is why cat /proc/driver/nvidia/agp/status has been reporting NVIDIA agp driver for a while... Now I have enabled intel support, proc is reporting Driver:          AGPGART again, as it should, and, I can still produce these hard locks  :Sad: 

However! Now I have managed to find something extra in my logs!

Jan  8 02:22:17 vorpal NVRM: Xid (0001:00): 3, C 00000000 SC 00000004 M 000002fc Data 00000003

Jan  8 02:22:17 vorpal NVRM: Xid (0001:00): 28,  L1 -> L0

Jan  8 02:22:27 vorpal SysRq : SAK

etc

Its one of these goddamn nvidia NVRM Xid errors!

Now I'm off to search the web and refresh my memory re these horrid things.

I'll also have a go at ssh'ing into the box once I've crashed it again in a few minutes.

----------

## PantsMan

I've just found these two threads:

X.org Lockups:

https://forums.gentoo.org/viewtopic-t-198023-postdays-0-postorder-asc-start-0.html

X.org Lockups (part 2)

https://forums.gentoo.org/viewtopic-t-334436-postdays-0-postorder-asc-start-875.html

Thats over 30 odd pages each... so its going to take a while to skim through all of that...

----------

## PantsMan

Turns out that ssh is not killed - and it remains active as long as I am already logged in already. Probably I would be able to log in after using the SAK to kill X also, though I havent tested that. In any case, I am able to kill X - but as soon as I restart it it hogs 100% CPU again, even if I do rmmod nvidia first, from ssh session. Screen remains black  :Sad: 

According to nvidia forums a lot of these Xid errors are caused by bad hardware, though I am not quite convinced yet. It just seems too much of a coincidence that these errors started immediately after upgrading my nice stable system that I havent changed much for 6 months. Nevertheless, it could be hardware gone bad i spose. Coincidences do happen.

I can test the system in Windows, and see how the hardware performs there...

In any case, this looks more like an X problem than a kernel problem so I spose I better mark this thread solved and move my bitching over to the Xorg Lockups 2 thread  :Smile: 

Thanks for the advice, NoError, timeBandit and others.

----------

