# trouble-shooting critical system hang

## darkphader

I need some assistance tracking down the cause of a system hang.

The server (no X installed) handles dns, dhcp, and ntp plus file and print sharing via samba. The client mainly notices that the shares are offline and they can't get any work done, but all services are suspended. The monitor is blank and access via ssh is refused. The system does, however, respond to an arping and usually a ping as well. The logs indicate nothing.

There is no magic with magic sysreq - the elephants are very boring and the system does not reboot, it requires a power on reset.

No specific set of circumstances has been identified with the issue. It may run OK for only half an hour, but usually several days to a week. EDIT: should also mention it has only happened during production, never after hours.

Thanks for any clues/ideas.

Chris

----------

## Baly

It almost sounds like it may be going OOM, I would suggest you fire off a top job in batch mode dumping to a file every 30 seconds to 5 minutes depending on your preference.  That will at least let you have a look at what was occurring prior to the system becoming unresponsive and may give you some additional insight.  Something like:

top -b -d 30 > top.out &

----------

## pappy_mcfae

 *darkphader wrote:*   

> EDIT: should also mention it has only happened during production, never after hours.
> 
> Thanks for any clues/ideas.
> 
> Chris

 

Herein lies the biggest clue. When was the last time you opened said machine and checked it for dust accumulation, restricted air flow, or the like? Did it start acting this way suddenly, or progressively over time? Have you allowed the server to just sit doing nothing (or being "locked") until it comes back around of its own accord? Do you get excessive hdd light operation during the lockup, or does the hdd light stay off? 

Since it is ping-able, that tells me it's not dying completely. However, it is clear that something is putting it under a lot of strain.

Good luck on that.

Blessed be!

Pappy

----------

## darkphader

Thanks for the ideas. I am now running top to see if that will provide any info on a failure. And will also look into the possibility of overheating, although the system is in a rack in a very clean room so I was totally discounting it, but agree it should be examined.

----------

## pappy_mcfae

"Clean" is a relative term. No matter where computers are operated, there will be dust. Computers like dust, especially the cooling fins of the CPU heatsink.

Blessed be!

Pappy

----------

