# mencoder random reboot -- how to debug?

## Dave_Lindquist

I've been running some lengthy mencoder processes on my quad-core (32-bit Gentoo), and have encountered a weird problem.

After running about 2-3 mencoder processes in a row, each about 2 hours or so long, the machine simply reboots.

I've checked /var/log/everything (metalog), dmesg, etc -- but I can't see any sort of diagnostic information that would help to figure out what is going on.

The processes only take about 20-30% CPU, since they don't seem to exercise more than 1 core; I've definitely run a lot of stuff that hits the machine harder than this before.

Can anyone help me with suggestions of what kind of logging/etc I should turn on to try to figure out what is happening?

Thanks!

----------

## eccerr0r

Software causing reboots in user mode tends to be a fairly serious security hole.  I'll assume you're not running as root and discovered a bug/security hole in mencoder to reboot the system.

Likely what's happening is that the board health monitor is detecting some out-of-norm situation and issuing a reset.  You could emerge lm-sensors and make sure the core2temp sensor module is available, and make sure your computer isn't overheating (or check the computer for dirty heatsinks, clogged fans, improperly assembled heatsink, etc.).  Also check the northbridge for the same issues.

The other question is if your machine is stable after sustained compilation (that does not wait on disks...waiting on disks will cool down the cpu)?  But you should check the temperature to make sure that similar conditions are being met.

----------

## Dave_Lindquist

Yeah, I've definitely run the machine harder (though perhaps not for as long) and had no problems.

I'm not running as root.

I thought of the health thing, so I've been monitoring with lm_sensors (I've got a screen session on another comp watching repeated sensors output so I can check results at reboot time), but so far the temps don't look anywhere near nasty.  (I've got one core at ~53 and the rest at ~45 -- Celsius, that is.)

I don't have any fan or dust problems (I check fairly regularly, plus this machine is relatively new).

Is there a linux-user-space equivalent of something like prime95?

----------

## Dave_Lindquist

Oh, I guess I should mention, too...

This mencoder process is pulling the source file across NFS -- so it is actually fairly IO-bound as well (only 100-meg network, and fairly normal hard-drives -- ie, not super-fast IO).

----------

## eccerr0r

Honestly the tools needed to really debug such failure are out of the reach of most consumers.  Logic analyzer?

Your best bet is to start swapping out hardware, as likely it is some hardware issue, more likely with motherboard/chipset, possibly CPU.

Other tests, do you see reboot if you run from local disk instead of NFS?  It at least would narrow it down to possibly motherboard-ethernet related.

----------

## Dave_Lindquist

Yeah, my next thought was to run local and see if that made any difference.

One other thought -- I think I have the kernel set to auto-reboot on panic... Is there any way to know if a panic has occurred?

----------

## eccerr0r

Aha... don't do that (reboot on panic)... the kernel faults/oops contain useful information to debug.

Now that we know that it's probably a panic/oops we need that information.

Panic information tends to not get written to disk (usually panic means that the state of the kernel is so corrupt that writing to disk could cause additional corruption) so you need to somehow get that information without disk.  If you can get it to panic in console or serial console mode, that information would be very helpful.

----------

## Dave_Lindquist

Yeah, I guess I'll turn that off, then try it.

Too bad nca doesn't seem to work anymore -- I was going to screen it from another box and catch whatever happened  :Wink: 

----------

## DigitalCorpus

I have mencoder running for many many hours a week (I have over 340 hours of TV stored on my server in multiple resolutions). Corrupt source or a problem with a filesystem can cause a reboot like that. Before touching kernel options, try encoding different sources and instad of pulling them from the network, try them from disk first. If that works, try pulling them from the network during the encode. If all is good there then you have a source problem that is breaking mencoder really badly. I've had it happen.I should mention I run 64-bit and ~arch with no stability problems, but ymmv.

----------

## Dave_Lindquist

Well, I think I've managed to narrow things down at least a little...

First, I didn't have the reboot on panic option turned on in kernel (must have been a different server where I did that).

But bearing in mind hardware issues, I recompiled mencoder without nvidia in the VIDEO_CARDS option (I have integrated nVidia video, so I had that turned on).

So far, I've done almost double the encoding I'd ever managed before, with nary an issue.

I'll keep going to see if this is the problem or not...

Odd, though, doesn't seem to make much difference (if any) to time needed to do the encoding...

Thanks for all the suggestions!

----------

## DigitalCorpus

The video card option should really only affect decoding upon playback if I'm not mistaken.

----------

## Dave_Lindquist

Hmm... That's odd, then, because I *seem* to be getting much more stability with it compiled with that off...

----------

