# consequent freezes at 1 hour uptime?! [worked-around]

## JeroenV

Hi,

without any obvious reason, my small energy-saving epia server started to behave badly: it freezes consequently at 1 hour uptime (or 59 minutes?) It is still responding to ping, but nothing more (also no console logins) and the logs show nothing out of the ordinary. The only way to reset it is by using the power button.

I haven't done any heavy upgrading on this box just before this started, although I have tried a few versions of nfs-utils because I was having problems with subtree_check/stale nfs handle on clients. 

Any ideas what could have caused this   :Question: 

TIA   :Cool: 

EDIT: I noticed a process that is not familiar:

```

# ps auxc

...

root      1697  0.0  0.0      0     0 ?        S<   20:28   0:00 ksuspend_usbd

...

```

Since my freezes are so exactly timed, I think a good point to start troubleshooting is to disable everything related to time or powersaving (i.e. I already disabled ntp etc.)

The funny thing with ksuspend_usbd is, that the only thing I could find that might be related to it seems to be disabled?

```

# cat .config | grep SUSP

# CONFIG_SOFTWARE_SUSPEND is not set

# CONFIG_APM_IGNORE_USER_SUSPEND is not set

# CONFIG_X86_GX_SUSPMOD is not set

# CONFIG_USB_SUSPEND is not set

```

Anyway, I'm still not any closer to a solution   :Sad: 

----------

## WhiteHat237

Maybe a cron job?  Cron.hourly?  what does:

```
cat /etc/crontab
```

show?

----------

## JeroenV

I don't have hourly cronjobs (/etc/cron.hourly is empty), and the funny thing is that the system hangs not on scheduled times (e.g. at every whole hour), but after exactly 1 hour uptime. I hacked my cron (crontab and a small change to /usr/sbin/run-crons) so that it supports cron.minutely, which I use to track this behaviour.

/etc/crontab

```

# /etc/crontab

*  *  * * *      rm -f /var/spool/cron/lastrun/cron.minutely

0  *  * * *      rm -f /var/spool/cron/lastrun/cron.hourly

1  3  * * *      rm -f /var/spool/cron/lastrun/cron.daily

15 4  * * 6      rm -f /var/spool/cron/lastrun/cron.weekly

30 5  1 * *      rm -f /var/spool/cron/lastrun/cron.monthly

*  *  * * *      /usr/bin/test -x /usr/sbin/run-crons && /usr/sbin/run-crons

# */10  *  * * *      /usr/bin/test -x /usr/sbin/run-crons && /usr/sbin/run-crons

```

/etc/cron.minutely/mark

```

#! /bin/bash

logger -t '-MARK-' -- \(still alive with uptime $(uptime)\)

```

Now my syslog looks like this (the filtered lines really don't show anything interesting, sometimes only cron-output for one hour):

```

# cat /var/log/messages | grep uptime

Mar 12 23:28:00 maple -MARK-: (still alive with uptime 23:28:00 up 43 min, 2 users, load average: 0.00, 0.12, 0.50)

Mar 12 23:29:00 maple -MARK-: (still alive with uptime 23:29:00 up 44 min, 2 users, load average: 0.00, 0.10, 0.46)

Mar 12 23:30:00 maple -MARK-: (still alive with uptime 23:30:00 up 45 min, 2 users, load average: 0.00, 0.08, 0.43)

Mar 12 23:31:00 maple -MARK-: (still alive with uptime 23:31:00 up 46 min, 2 users, load average: 0.00, 0.06, 0.40)

Mar 12 23:33:00 maple -MARK-: (still alive with uptime 23:33:00 up 48 min, 2 users, load average: 0.00, 0.04, 0.35)

Mar 12 23:34:00 maple -MARK-: (still alive with uptime 23:34:00 up 49 min, 2 users, load average: 0.00, 0.03, 0.32)

Mar 12 23:35:00 maple -MARK-: (still alive with uptime 23:35:00 up 50 min, 2 users, load average: 0.00, 0.02, 0.30)

Mar 12 23:37:00 maple -MARK-: (still alive with uptime 23:37:00 up 52 min, 2 users, load average: 0.00, 0.01, 0.26)

Mar 12 23:38:00 maple -MARK-: (still alive with uptime 23:38:00 up 53 min, 2 users, load average: 0.00, 0.00, 0.24)

Mar 12 23:39:00 maple -MARK-: (still alive with uptime 23:39:00 up 54 min, 2 users, load average: 0.00, 0.00, 0.22)

Mar 12 23:41:00 maple -MARK-: (still alive with uptime 23:41:00 up 56 min, 2 users, load average: 0.00, 0.00, 0.19)

Mar 12 23:42:00 maple -MARK-: (still alive with uptime 23:42:00 up 57 min, 2 users, load average: 0.00, 0.00, 0.17)

Mar 12 23:43:00 maple -MARK-: (still alive with uptime 23:43:00 up 58 min, 2 users, load average: 0.00, 0.00, 0.16)

Mar 13 08:31:00 maple -MARK-: (still alive with uptime 08:31:00 up 1 min, 0 users, load average: 1.30, 0.57, 0.21)

Mar 13 08:32:00 maple -MARK-: (still alive with uptime 08:32:00 up 2 min, 0 users, load average: 0.48, 0.47, 0.19)

....

....

Mar 13 09:25:00 maple -MARK-: (still alive with uptime 09:25:00 up 55 min, 1 user, load average: 0.00, 0.00, 0.00)

Mar 13 09:26:00 maple -MARK-: (still alive with uptime 09:26:00 up 56 min, 1 user, load average: 0.00, 0.00, 0.00)

Mar 13 09:27:00 maple -MARK-: (still alive with uptime 09:27:00 up 57 min, 1 user, load average: 0.00, 0.00, 0.00)

Mar 13 09:28:00 maple -MARK-: (still alive with uptime 09:28:00 up 58 min, 1 user, load average: 0.00, 0.00, 0.00)

Mar 13 09:29:00 maple -MARK-: (still alive with uptime 09:29:00 up 59 min, 1 user, load average: 0.00, 0.00, 0.00)

Mar 13 10:09:00 maple -MARK-: (still alive with uptime 10:09:00 up 1 min, 1 user, load average: 1.61, 0.82, 0.31)

Mar 13 10:10:00 maple -MARK-: (still alive with uptime 10:10:00 up 2 min, 1 user, load average: 0.59, 0.67, 0.29)

Mar 13 10:11:00 maple -MARK-: (still alive with uptime 10:11:00 up 3 min, 1 user, load average: 0.21, 0.55, 0.27)

Mar 13 10:12:00 maple -MARK-: (still alive with uptime 10:12:00 up 4 min, 1 user, load average: 0.14, 0.46, 0.25)

....

....

Mar 13 11:03:00 maple -MARK-: (still alive with uptime 11:03:00 up 55 min, 2 users, load average: 0.02, 0.16, 0.10)

Mar 13 11:04:00 maple -MARK-: (still alive with uptime 11:04:00 up 56 min, 2 users, load average: 0.01, 0.12, 0.09)

Mar 13 11:06:00 maple -MARK-: (still alive with uptime 11:06:00 up 58 min, 2 users, load average: 0.00, 0.08, 0.08)

Mar 13 11:21:01 maple -MARK-: (still alive with uptime 11:21:01 up 1 min, 0 users, load average: 1.30, 0.43, 0.15)

Mar 13 11:22:00 maple -MARK-: (still alive with uptime 11:22:00 up 2 min, 1 user, load average: 0.52, 0.36, 0.14)

Mar 13 11:23:00 maple -MARK-: (still alive with uptime 11:23:00 up 3 min, 1 user, load average: 0.61, 0.42, 0.18)

Mar 13 11:24:00 maple -MARK-: (still alive with uptime 11:24:00 up 4 min, 1 user, load average: 0.56, 0.42, 0.19)

....

....

Mar 13 12:18:00 maple -MARK-: (still alive with uptime 12:18:00 up 58 min, 2 users, load average: 0.00, 0.27, 0.75)

Mar 13 12:19:00 maple -MARK-: (still alive with uptime 12:19:00 up 59 min, 2 users, load average: 0.00, 0.22, 0.70)

Mar 13 12:22:15 maple -MARK-: (still alive with uptime 12:22:15 up 1 min, 0 users, load average: 1.33, 0.45, 0.16)

Mar 13 12:23:00 maple -MARK-: (still alive with uptime 12:23:00 up 2 min, 0 users, load average: 0.63, 0.39, 0.15)

....

....

Mar 13 13:18:00 maple -MARK-: (still alive with uptime 13:18:00 up 57 min, 1 user, load average: 0.00, 0.00, 0.00)

Mar 13 13:20:00 maple -MARK-: (still alive with uptime 13:20:00 up 59 min, 1 user, load average: 0.00, 0.00, 0.00)

Mar 13 13:26:09 maple -MARK-: (still alive with uptime 13:26:09 up 1 min, 0 users, load average: 1.83, 0.75, 0.27)

Mar 13 13:27:00 maple -MARK-: (still alive with uptime 13:27:00 up 2 min, 0 users, load average: 0.76, 0.64, 0.26)

....

....

```

----------

## JeroenV

Ok, it seems to be magically solved   :Surprised: 

For the sake of productivity I have taken many measures at the same time instead of very systematically isolating the issue, so the solution could have been among the following:

 The more obscure BIOS powersaving settings

 emerge -e system (refreshing the system)

 slightly more stripped kernel (i.e. some things I don't really need stripped out)

 hdparm -s0 -S0 ....

Anyway, I now reached an uptime of 1:30 already   :Very Happy: 

----------

## JeroenV

It doesn't get any stranger than this: the problem came back   :Twisted Evil: 

This time with an uptime quota of exactly 2 hours   :Exclamation: 

The only thing I did to make the problem re-appear was reconnect the original PSU (I had been running for 2 weeks under "lab-conditions with an opened case and another PSU) and close the case.

After shit happened, I reversed this (opened the case and re-connected the PSU with which it had been running succesfully for 2 weeks), but the problem remained.

I couldn't understand anyway how this possibly could be a hardware problem, the freezes are too well timed for that. On the other hand, I've never experienced such frequent hard-locks on a non-gui system due to a software problem (under linux, that is)

----------

## Janne Pikkarainen

I've seen one server behaving roughly like this - in that case a faulty CPU was the reason.

You might want to try some live-cd, such as Knoppix, and run some stress-tests from there. If that crashes, too, then it must be the hardware.

----------

## widan

On this page there is a comment about crashes after approximately one hour caused by CPU frequency scaling on EPIA boards:

 *Quote:*   

> Lets try to make a web server to leave on all the time at home ... I can scale the cpu clock down to reduce to a minimum power consumption ... longhaul crashes the system after one hour or so (this is true of all 3 systems).

 

----------

## JeroenV

thanks, eventually I made a boot-usb-stick with the gentoo minimal install cd on it, and I've been running on it now for more than 2 hours, i.e. the problem lies somewhere in my software. It's extremely strange though that I can't seem to track down which package.

I've already tried a kernel without any cpufreq and powersaving stuff => no difference. It seems I'm going to have to install this box from scratch   :Evil or Very Mad: 

@Janne:

I can't possibly imagine how it could be faulty hardware if it always hangs exactly after 2 hours uptime, regardless of system load  :Question: 

(actually this machine doesn't do much more than run a webserver, receive faxes and serve a few files)

----------

## JeroenV

update:

It's still quite strange: I re-installed the server from scratch, during installation (also after booting into my own kernel) everything ran fine.

After installing all software and rebooting, the problem re-appeared, i.e. freezes at exactly 2 hours after boot.

I rebooted, and at 1:55 uptime stopped all services (all except sshd from default runlevel) and unloaded all kernel-modules not in use.

The system made it past the 2:00 mark, and at 2:05 I restarted all services and the system has been running fine since (up 3 days, 23:50)

Since I have done something similar before, but left some services running, now my prime suspects are the connexant hsf modem modules and/or hylafax, though the well timed freezes in relation to them are still a mystery.

I'll do some more research to straighten this out when I have more time, any informed guesses about hylafax or modem driver behaviour are of course welcome.

----------

## JeroenV

after I had to reboot the server after almost a month uptime, the same problem re-appeared.

Indeed it is confirmed that I get past 2 hours uptime if I stop faxgetty and unload the conexant hsf modem modules.

I made an "anti-freeze watchdog" that at 1:58 uptime does the following:

 disable faxgetty in inittab

 reload inittab (init q)

 stop all services except local and ssh

 wait 3 minutes

 reverse the above

The script is automatically run after a reboot:

```

#!/bin/bash

# freezedog.sh

echo "Started $0"

while [ -z "$(uptime | awk '{ print $3 }' | grep 1:58)" ]; do

        sleep 1m

        echo "$0: Evaluate uptime $(uptime)"

done

echo "Critical uptime $(uptime) reached!"

TOKILL="$(rc-status | awk '{ print $1 }' | grep -v Runlevel | grep -v sshd | grep -v local | grep -v freezedog)"

# switch off faxgetty

echo "backing up inittab and switching off faxgetty"

cp /etc/inittab /etc/.inittab.bak

cat /etc/.inittab.bak  | sed s/S0:23/\#S0:23/ > /etc/inittab

echo "re-reading inittab"

init q

echo "restoring inittab"

cp -f /etc/.inittab.bak /etc/inittab

# also the modem is stopped (/etc/init.d/hsf stop unloads all hsf modules)

echo "stopping services $TOKILL"

for N in $TOKILL; do

        /etc/init.d/$N stop

done

echo "stopped services"

sleep 3m

echo "starting services $TOKILL"

for N in $TOKILL; do

        /etc/init.d/$N start

done

echo "started services"

echo "re-reading inittab"

init q

echo "Critical uptime $(uptime) passed, quitting..."

```

Who would have thought that something as strange as this would have ever been necessary   :Shocked: 

----------

