# how to diagnose random freezes?

## ExecutorElassus

So, I'm back to posting from my smartphone. Ever since about last week, my gentoo box (kernel is 3.9.4) has been exhibiting random runtime crashes. Only the reset button restarts, and the system is unresponsive to SSH login attempts. Thus far, this only happens when X is running, but with minimal other stuff (the most recent was last night, when I had just rebooted and was waiting for mdadm to resync).

So: how do I find out which program/package is failing? I don't think its hardware (smartctl reports no errors for the drives), but I can't rule it out completely. 

The hardcore approach of "downgrade everything until it's stable again" is not likely to be effective because I've emerged/updated a couple hundred packages since last week. I suspect it's one of xorg, the kernel, the nvidia driver, or udev, but I can't be positive. Is there somewhere where a kernel log might record crash dumps? I have syslog running.

On a side note, how do I get /var/log/messages to be more readable? Right now it's line after line of hex codes, with a mashed-up block of text in a non-wrapped column to the right. Or are kernel messages logged elsewhere?

Cheers,

EE

----------

## Logicien

Hi,

you say Gentoo only crash when X is running. It is possible that your kernel configuration is problematic for the Nvidia driver and X. For examples, Drm and framefuffer.

X can have configuration problems. Starting X with no configuration file can help. The kernel and X logs should be readable after a freeze. You should try to use some other X video drivers like nouveau, nv, vesa and fbdev to be sure the problem is not only Nvidia proprietary driver related.

----------

## binbash

9 times out of 10 random freezes are caused by bad ram, so check yours with memtest before wasting time with anything else.

----------

## cwr

And the tenth time they are due to overheating.

Will

----------

## ExecutorElassus

… except for the part where I downgraded from the RC of xorg-server and the random freezes stopped happening. I've been running now without problem for a couple weeks; after I upgraded to xorg-server-1.14-902 (I think that was the number; it might have been 901), it was locking up every few hours.

So I don't think it was heat, or bad memory. But now the non-RC version of xorg-server is out, and I'll see if the problem returns.

I just wish xorg would log some message when it crashes, so I could figure out what was going wrong.

I'll keep y'all posted. Thanks for the tips.

Cheers,

EE

----------

## depontius

A couple of things.

First, when this happens is the caps-lock LED blinking?  There is code somewhere in there to blink out crash information in Morse Code.  I've never actually copied this myself, though I've seen it happen.  I haven't done Morse Code since high school.

Second, do you have the Magic SysRq key enabled?  One of the first things I do is enable it in the kernel and in "/etc/sysctl.conf".

The problem here is that maybe your crash information got logged, but if you've just hit the reset button, those log entries never made it out to disk.  With the Magic SysRq key you can sync your disks and get a much cleaner reboot than with the reset button.  Then it's just possible that relevant information might make it to your logs.

Another thing you might try is remote logging.  Let's say that the kernel really is crashing, Magic SysRq won't work, so you can't get logging information written.  If you can at least get that information sent to a remote logging machine, maybe you can find the info there.

The one time I went this far, the problem turned out to be a bad power supply which was just dropping power.  No local or remote log info, no kernel panic info.  But that lack of info itself was informative.  A new power supply solved the problem.

----------

