# Machine keeps rebooting

## curmudgeon

I have a machine that has become a major headache.

I have noticed an occasional random boot (system just restarts without any warning) in the past (maybe once a week or so), which should have warned me, but now the problem has become much worse.

The facts:

1. Asus p5vd2-mx running amd64 gentoo with any recent kernel (2.6.36-gentoo-r8,  2.6.37-gentoo-r4, 2.6.38-gentoo-r6).

2. Machine will (apparently) stay up indefinitely when not under load.

3. When stressed (in particular when compiling), the machine will reboot (always) within five to twenty minutes.

4. I am attempting to install the kde 4.6 upgrade, and I can not get by some of the bigger pieces (I have simply done emerge --resume - over thirty times now - to continue, but that is obviously not a viable solution).

5. I am not doing anything funny to the machine. It is a standard Intel E6300  running at the standard 1.86 GHz.

6. I have run memtest86 (from the gestoo live dvd) overnight, and it has not detected any problems.

7. I don't believe it is a thermal problem (the sensors command shows peak temperatures of about 60 C when compiling).

Does anyone have any idea what could be causing this and/or how to track it down? Thank you in advance.

----------

## Hu

Even though you say you checked the thermal sensors, my first suspects would be overheating or possibly inadequate power supply.  Bad RAM tends to manifest as program crashes, not spontaneous system reboots.

You say the machine always crashes when stressed.  Does this apply even for trivial stresses, such as running a CPU hog on each core (with no load on the disk or RAM, just spinning the CPU)?

----------

## curmudgeon

 *Hu wrote:*   

> Even though you say you checked the thermal sensors, my first suspects would be overheating or possibly inadequate power supply. 

 

I don't have any reason to suspect overheating, though I suppose some fan could have quit. I can't find any sensors (other than coretemp), which seems strange for an Asus board (and there is fan information available in the BIOS pages, which I will have to check that next time).

The power supply theory (which I hadn't thought of before) sounds intriguing. Not inadequate, but perhaps failing (to explain the recent deterioration). Actually, I did get the impression (I bought the machine used) that it was a cheap power supply (local voltage only, not 120/240), so problems there would not surprise me.

Is there some way of testing it (short of buying another one and plugging it in :) )? I do have a cheap Chinese "power supply tester," but I actually wouldn't know what to look for.

 *Hu wrote:*   

> You say the machine always crashes when stressed.  Does this apply even for trivial stresses, such as running a CPU hog on each core (with no load on the disk or RAM, just spinning the CPU)?

 

I had not tried that, but at the moment, I am running (in two terminals):

```

dd if=/dev/urandom of=/dev/null

```

I have the temperatures up to 74 C and 75 C on the cores (higher than I have ever seen before, and "high" is 74 C). Top is showing

```

top - 05:42:05 up 12 min,  3 users,  load average: 2.44, 2.02, 1.12

Tasks:  94 total,   3 running,  91 sleeping,   0 stopped,   0 zombie

Cpu(s):  0.5%us, 99.5%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Mem:   3088560k total,   404408k used,  2684152k free,   147848k buffers

Swap:  2056284k total,        0k used,  2056284k free,   110728k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND

 2108 user      20   0 11472  688  568 R  100  0.0  10:18.76 dd

 2109 user      20   0 11472  692  568 R  100  0.0  10:08.76 dd

```

Let's assume this doesn't reboot (I will run it for an hour or so). Does that indicate that hard drive accesses are overstressing the power supply?

----------

## Hu

If that does not reboot, it would tend to rule out CPU overheating.  Hard drive access stressing the power supply remains a possibility, but is not proved by the failure to reboot when stressing the CPU.  You could stress the disk with something like (untested):

```
while :;

do

    dd if=/dev/sda of=/dev/null skip=$RANDOM count=1 bs=4096 iflag=direct

done
```

This should seek to random places on the disk, read 4K, and repeat, as fast as possible.  It is non-destructive.  If that does not elicit a reboot, try running both this fragment and your previous RNG stress concurrently, which would be characteristic of a heavy compilation workload.

----------

## curmudgeon

Well, now I am more confused than ever. I ran that script with two of the CPU stress tests, and the machine has stayed up for two hours (the longest I have seen it stay up when trying to compile something over the past several days is thirty minutes). Any other ideas?

----------

## curmudgeon

Still getting nowhere with this.

At the suggestion of someone, I downloaded an tried to run the livecd from http://www.inquisitor.ru/

It didn't impress me too much. I kept seeing a lot of errors attempting to load modules, and a majority of the tests were destructive.

I tried the smart test (I have smartd running on the machine and do surface tests regularly, and have never encountered any problems).

It returned:

```

Test hdd-smart[CALLED]

Initial /dev/sda testTest hdd-smart[FAILED]

Fatal failure: testing stopped

Reason:

```

Not very helpful.

Anyway, the more interesting event occurred when performing one of the stress tests that does video transcoding. That test failed (spectacularly) after a few moments with:

```

Test benchmark-hqrip[CALLED]

Creating filesystem...[ DONE ]

Copying source file...[ DONE ]

Transcoding... [ 9873.505196]

[ 9873.505196] HARDWARE ERRORo irq handler for vector

[ 9873.505196] Kernel panic - not syncing: Machine checkcontact your hardware ve

```

Note that I copied the output EXACTLY (typos and all). I don't know if this represents the same problem or not (why is gentoo rebooting rather than just stopping with a kernel panic?), but it might provide someone with just enough information to come up with an idea.

----------

## MacGyver031

 *curmudgeon wrote:*   

> Well, now I am more confused than ever. I ran that script with two of the CPU stress tests, and the machine has stayed up for two hours (the longest I have seen it stay up when trying to compile something over the past several days is thirty minutes). Any other ideas?

 

I had a similar situation: I had changed my original powersupply with an other one. The box rebooted at every compilation. The trick which I did was to exchange the disk (3.5inch) with laptop-disk(2.5inch) and take out the dvd-drive.

After this, the box only reboots if compilation takes long enough to get the cpu-fan drive to 100% capacity.

If you have checked memory (memtest86) and disk (badblocks), you should consider in using a better (higher wattage) powersupply.

In my case, the mainboard is running at the limits of 3.3V.

----------

## curmudgeon

This is still not solved (though I have made progress).

I tried a brand new high wattage power supply, and it was noticeably better, but instead of rebooting every 10-30 minutes, it rebooted every 10-50 minutes. I did a lot more tests disabling devices and what not, and nothing seemed to help.

Finally, I came up with the obvious idea of locking the cpu to the lower speed (it only runs at 1.87 GHz and 1.60 GHz, and obviously was always using 1.87 GHz when compiling). That actually made a huge difference. The machine did not reboot even after several hours of compilation.

Still want to get whatever the problem is fixed. Any newer ideas about the possible causes of this?

----------

