# [REOPENED] Auto-Rebooting System

## Holysword

Recently my system started to reboot instantly without any kind of warning. I don't know what it can be related, 'cause I don't know if there is any kind of log where kernel may register this kind of thing, so the only thing I could do is to check every piece I could find of my system.

I noticed that it crashes only when it is performing a hard task, like compiling, or a hard math calculation (I use matlab frequently). For any other meanings, its running fine.

First I thought it has something to do with kernel. I performed tests with 4 diferent kernel versions: zen-2.6.29, zen-2.6.30-rc5, gentoo-2.6.26-rc4, gentoo-2.6.28.5. All of them presented the same behaviour.

Then I considered any kind of weird bug with composition or graphical related stuffs. I tried to run compilations into pure console, and got the same behaviour.

The crashes are kinda random... it does not have a approximate time to happen, it just happens sometime if I let my computer doing hardwor, sometimes in 4min, sometimes in 4s, sometimes in 4h. So as it does not have a pattern I considered it a memory fail... but I don't know how to test the memory.

Appreciate any help.

----------

## eccerr0r

Some motherboards are designed to do something when the processor overheats.  Might want to check if this is happening.  If you can underclock your cpu, that's another thing to try.

Usually RAM errors don't behave like this.  Normally it's motherboard or possibly CPU issues.

You can download and test memtest86+ to check your RAM.  It's a good idea to check it anyway.

----------

## AaronPPC

I have had 3 computers that had CPU problems and they exibited symptoms very similar to yours.

----------

## ronmon

I'll bet that it's heat. Mine started doing this a few weeks ago when the weather started getting warmer here in the sub-tropics. Watching in gkrellm, I saw my CPU temp go up to almost 70C under heavy loads.

Going into BIOS and dropping the CPU voltage from 1.35 to 1.30 did the trick. Just .05 volt difference, and now it never goes over 40C.

----------

## Holysword

I didn't think it was the heat 'cause I've never overclocked my CPU. But when you guys mentioned the BIOS I remembered that I've enabled some "fancy" options in there, like CPU Fan warnings that are disabled by default. "Load Optimal Defaults" fixed. Maybe this kind of "Warning" signal was confusing the kernel?

Anyway, the temperature sensors are not found here even though I enabled it in kernel... where am I supposed to find them?

----------

## ronmon

Mine wasn't overclocked either.

You need to emerge lm_sensors, start it and set it to start automatically on boot.

```

/etc/init.d/lm_sensors start

rc-update add lm_sensors default

```

After that, you can check it from a terminal with the "sensors" command or use one of the many desktop applets available to monitor it constantly.

Edit: Just one more thought. If disabling fan and/or temperature warnings in your BIOS stops the machine from shutting down when it really is overheating you could fry your stuff. Get your sensors working to give yourself some peace of mind.

----------

## entrophie

Holysword: you don't need to overclock your CPU to get overheating.  Some silver contacts beetwen cpu and radiator tends to leak. So after you check the temperature with sensors, you can try to improve the contact.

----------

## szczerb

First of all check if your radiator is not full of thick dust that very well stops the airflow.

----------

## Holysword

Thank you guys for answering, but I still don't think its overheat. I've checked the temperature with sensors and it seems okay.

----------

## eccerr0r

I guess if you rule out all the normal reasons that cause reboot then the only things that remain are the non-normal...

and that are

1. Hackers.  But very unlikely.

2. Your hardware is broken.  You need to buy a new power supply most likely, or possibly motherboard.

----------

## Holysword

I still don't think its any of those. As I stated before, I disabled some non-default bios options and the problem has gone.

----------

## Holysword

Turns out that the problem came back, but more frequently.

Again, its not about the temperature (I still check the temperature sensors, nothing odd. Sometimes it crashes at 47°C). The difference is that now I don't need to be doing something aggressive, which makes this problem even more annoying.

I'll try to install the memtest86+ into that grub fancy way and check the memory, and I'll post here the results.

----------

## rjw8703

I had this problem a while back.  It turned out to be the voltage regulators on the m/b were dying intermittently.  Replacing the m/b fixed the problem.

----------

## Gusar

Faulty power supply would be my guess. I had a problem kinda like that. After an hour or so, the machine would reboot. And then it would reboot every few minutes. Leaving it off for a while, it would work for an hour again. Changing the power supply made the problem go away.

----------

## Holysword

 *Gusar wrote:*   

> Faulty power supply would be my guess. I had a problem kinda like that. After an hour or so, the machine would reboot. And then it would reboot every few minutes. Leaving it off for a while, it would work for an hour again. Changing the power supply made the problem go away.

 

It cannot be the same problem that I have, since my machine turns off but does not reboot in few minutes (actually it can have hours or days between the reboots). Its really random.

 *rjw8703 wrote:*   

> I had this problem a while back. It turned out to be the voltage regulators on the m/b were dying intermittently. Replacing the m/b fixed the problem.

 

There is any way to test that to make sure that its the voltage regulator?

----------

## eccerr0r

 *Holysword wrote:*   

>  *Gusar wrote:*   Faulty power supply would be my guess. I had a problem kinda like that. After an hour or so, the machine would reboot. And then it would reboot every few minutes. Leaving it off for a while, it would work for an hour again. Changing the power supply made the problem go away. 
> 
> It cannot be the same problem that I have, since my machine turns off but does not reboot in few minutes (actually it can have hours or days between the reboots). Its really random.
> 
>  *rjw8703 wrote:*   I had this problem a while back. It turned out to be the voltage regulators on the m/b were dying intermittently. Replacing the m/b fixed the problem. 
> ...

 

Bad power will produce random results.

There's really no way to cheaply test/check PSUs and motherboard voltage regulators.  The equipment needed is basically a high speed DSO. A simple voltage check is not enough, it won't detect intermittent spikes.  Replacing the motherboard/psu is _much_ cheaper, even for diagnostics, unless you just so happened to have a DSO burning in your pocket (or garage or something).

----------

## Holysword

Well, while I don't find someone to test my motherboard, I was checking dmesg and I suddenly realized that from time to time it complains about the voltage of in5 being 0. Follows the relevant part of "sensors":

```
in0:       +1.15 V  (min =  +0.00 V, max =  +4.08 V)        

in1:       +2.14 V  (min =  +0.00 V, max =  +4.08 V)        

in2:       +3.39 V  (min =  +0.00 V, max =  +4.08 V)        

in3:       +2.96 V  (min =  +0.00 V, max =  +4.08 V)        

in4:       +0.48 V  (min =  +0.00 V, max =  +0.74 V)        

in5:       +0.00 V  (min =  +0.00 V, max =  +4.08 V)   ALARM

in6:       +1.06 V  (min =  +0.00 V, max =  +4.08 V)

in7:       +3.06 V  (min =  +0.00 V, max =  +4.08 V)

in8:       +3.31 V
```

I'm not sure about its meaning, and I don't know if that's something to worry about too, but the "ALARM" word never sounds good...

----------

## eccerr0r

Well that proves to show that lm-sensors is unreliable, and nothing else.  There are no transistors that will work at 0V so obviously something's wrong with the detection there, or that input is simply unused.  Or perhaps your PSU doesn't supply -5V as most mb's don't use it nowadays, and it's perfectly fine for it to be at 0V.

Not only that, likely the multiplier constants in your lm-sensors.conf do not match your motherboard and is producing really unreliable results - where's your 12V line?  Where's your -12V line?  Which line is which?

"ALARM" just means the chip detected a number from the voltage line that was outside its bounds.  But what are the bounds?  The bounds are set up by software which once again may or may not match your motherboard to produce valid results.

I'm sorry, the cheapest way is to buy a new MB or PSU to test.  You might be able to get away with a multimeter but again, multimeters also tend to be slow and won't detect fast glitches in your power.  There's no other way you can really tell for sure.  The on board sensor chips are not only inaccurate, but also slow - at most a few samples every second versus millions of samples for a DSO.

----------

## Holysword

 *eccerr0r wrote:*   

> Well that proves to show that lm-sensors is unreliable, and nothing else. 

 

Maybe it proves how n00b I am, 'cause I haven't configured properly those things :S Anyway, in the future I will try to configure it.

----------

## eccerr0r

 *Holysword wrote:*   

> Anyway, in the future I will try to configure [lm-sensors].

 

Unless someone else has figured out the numbers for the exact board you have -- there's really no way for a "mere mortal" to configure it, without knowing exactly how it's wired on the motherboard -- some reverse engineering or motherboard manufacturer support is needed.

Just because another motherboard has the same chip as yours, means nothing to configuration.  It needs to be the exact same motherboard and revision of motherboard to use the same config.  This is because the resistors used may be hooked up differently on different boards and different revisions.

I basically ignore lm-sensors numbers.  One of my machine looks like

```

subaru:/root# sensors

it8718-isa-0290

Adapter: ISA adapter

in0:       +1.25 V  (min =  +0.00 V, max =  +4.08 V)   

in1:       +1.84 V  (min =  +0.00 V, max =  +4.08 V)   

in2:       +3.31 V  (min =  +0.00 V, max =  +4.08 V)   

in3:       +2.93 V  (min =  +0.00 V, max =  +4.08 V)   

in4:       +3.06 V  (min =  +0.00 V, max =  +4.08 V)   

in5:       +0.00 V  (min =  +0.00 V, max =  +4.08 V)   ALARM

in6:       +1.18 V  (min =  +0.00 V, max =  +4.08 V)   

in7:       +4.08 V  (min =  +0.00 V, max =  +4.08 V)   ALARM

in8:       +3.18 V

fan1:     2096 RPM  (min =   10 RPM)                   

fan2:        0 RPM  (min =    0 RPM)                   

fan3:        0 RPM  (min =    0 RPM)                   

temp1:       +53 C  (low  =  +127 C, high =  +127 C)   sensor = thermistor   

temp2:        -2 C  (low  =  +127 C, high =  +127 C)   sensor = thermistor   

temp3:       +45 C  (low  =  +127 C, high =   +70 C)   sensor = diode   

vid:      +0.000 V

```

yet this machine runs perfectly fine despite double the number of alarms...

Another of my machines is a bit more correct after hacking /etc/sensors.conf slightly and sensors -s to reload:

```

doujima:~$ sensors

it87-isa-0290

Adapter: ISA adapter

CPU:       +1.58 V  (min =  +0.00 V, max =  +4.08 V)   

RAM:       +2.50 V  (min =  +0.00 V, max =  +4.08 V)   

+3.3V:     +3.20 V  (min =  +0.00 V, max =  +4.08 V)   

+5V:       +4.87 V  (min =  +0.00 V, max =  +6.85 V)   

+12V:     +12.16 V  (min =  +0.00 V, max = +16.32 V)   

-12V:     -12.38 V  (min =  -0.00 V, max = -14.69 V)   

-5V:       -5.54 V  (min =  -0.00 V, max =  -6.12 V)   

Stdby:     +5.05 V  (min =  +0.00 V, max =  +6.85 V)   

VBat:      +3.42 V

fan1/CPU: 3375 RPM  (min =    0 RPM, div = 8)          

fan2/PS:  2220 RPM  (min =    0 RPM, div = 8)          

Temp1/MB:    +34 C  (low  =    -2 C, high =  +254 C)   sensor = thermistor   

Temp2/VRM:   +40 C  (low  =    -1 C, high =  +127 C)   sensor = thermistor   

Temp3/CPU:   +53 C  (low  =    -1 C, high =  +127 C)   sensor = thermistor   

```

yet I still would not bet a penny those numbers are correct.  They merely "look" decent but still totally inaccurate (BTW first is a core2 board (gigabyte g31 board), second is an athlon in a fairly popular ECS k7s5a).  There could be some correlation from model to model, but there's definitely no guarantee the same input number is attached to the same rail.

----------

