# Semi-freezing servers

## Belliash

Hello guys,

I got 2 servers with almost the same hardware and on both of them I am experiencing the same problem.

From time to time, that is not very frequently, both servers randomly semi-freezes. By semi freeze I mean, that I cannot connect to freezed machine via SSH, or some other services, but for example web server is still responding...

When I open IPMI connection and KVM session, I see the login prompt and system is still responding. Sometimes, there is also information about init reload. Unfortunately, when I enter username and password, I dont get a shell prompt. I can still write on the console, but nothing really happens.

In other words, it looks like something happened to init, it got reloaded and some services got stopped, or not all of them got restarted. All I can do then to regain access to the server is to reboot the machine.

I have checked disks, RAM, did some stress tests and it does not look like a hardware problem. Especially both servers have nearly same hardware, same configuration, same kernel, etc... The difference between servers are different CPU model - L5630 vs L5640, different motherboard - X8STi-F vs X8ST3-F, some different PSU... but differences between them are significant - additional RAID controller on mainboard or 4 vs 6 core CPU. These are still same platform.

I have already excluded hardware problems and cannot find any issue on OS level yet.

Maybe some of you experienced similar problem?

----------

## mike155

I agree that what you describe doesn't look like a hardware problem.

Which init system do you use? OpenRC? Systemd? Something else?

 *Quote:*   

> In other words, it looks like something happened to init, it got reloaded and some services got stopped, 

 

If soemthing 'happens to init', you should see messages in the log files. What do the log files (or journalctl, if you you Systemd) tell you? What does dmesg tell you?

----------

## bunder

init reload is probably from updating openrc, i see that all the time on my high uptime boxes

edit: your description though makes me wonder if you lost all your disks... if that happens, anything in memory would still work until it needs to read/write from disk...  then it hangs, forever.

----------

## Belliash

[url]I dont see any errors in logs. Also I cannot execute any command - no access to shell even via IPMI, so no chance to check dmesg. However nothing is being written to logs.

I use openrc, as I have not migrated to systemd.

I have also checked smart, disks sufrace for bad blocks as well as raid status. I got 2 disks in RAID1 in both servers managed by mdadm. Also no errors reported by fsck.

My first thought was that is some hardware failure as this happened on just one server. But recently, it started to happen also on the second one. Yesterday I have upgraded system on both of them including kernel and rebooted them.

After about 15 hours 35 minutes I lost connection to one of them. At least such uptime is reported by uptimed.

Zabbix agent and WWW server were running. But I could not access it via SSH. Also when I logged in on console via IPMI, I didnt showed me bash prompt.

What is more, even I couldnt reach WWW server - zabbix reports, some access time to it all the time (see attached screenshot).

For sure something happens there, because context switches, cpu load, cpu utilization are increased... thats what zabbix reported before I rebooted the server.

```
     2     0 days, 15:35:22 | Linux 4.19.8-gentoo       Sat Dec 15 17:27:12 2018

->   3     0 days, 07:02:24 | Linux 4.19.8-gentoo       Sun Dec 16 09:04:39 2018
```

A you can see above, server was booted yesterday at 17:27:12 after upgrade and after 15:35 (today at 9:04) I have done a power cycle over IPMI...

----------

## Jaglover

I think bunder is right, for some reason you can't access your RAID array any more. You could set up remote logging ... just an idea.

----------

## Belliash

I have set up remote logging and unfortunately nothing has been logged since my last post on any of them... When the server freezes, the logs just got tampered.

Anyway, I have found some smart errors on one disk and I have replaced it. This ofc has not resolved the problem.

Both servers are still freezing randomly. I cannot login to them and they are not logging anything to the remote host.

Hard disks are healthy (one that seem to be broken has been replaced and aid has been rebuilt).

The most strange thing for me is that both servers have exact problem...

One thing I have also found in logs, but there is no time correlation with freezes:

```
syslog-ng[14366]: Number of allowed concurrent connections reached, rejecting connection; client='AF_UNIX(anonymous)', local='AF_UNIX(/dev/log)', max='256'
```

----------

## ct85711

I am thinking you may be encountering a similar issue as someone else had, https://forums.gentoo.org/viewtopic-t-937120-start-0.html.

In short, what was happening then (from my understanding on that thread), is that the system ended up getting bogged down by something (spawning numerous processes) trying to access the /dev/log.  Part of it may have been compounded by the part of his log file expanded beyond multiple GB.  Sadly, it doesn't say if they fixed it in that thread, but may give you something to look at.

----------

## Belliash

 *ct85711 wrote:*   

> I am thinking you may be encountering a similar issue as someone else had, https://forums.gentoo.org/viewtopic-t-937120-start-0.html.
> 
> In short, what was happening then (from my understanding on that thread), is that the system ended up getting bogged down by something (spawning numerous processes) trying to access the /dev/log.  Part of it may have been compounded by the part of his log file expanded beyond multiple GB.  Sadly, it doesn't say if they fixed it in that thread, but may give you something to look at.

 

I found a process that has been flooding syslog and I have resolved this problem, but servers are still freezing ...

----------

## Belliash

 *Belliash wrote:*   

>  *ct85711 wrote:*   I am thinking you may be encountering a similar issue as someone else had, https://forums.gentoo.org/viewtopic-t-937120-start-0.html.
> 
> In short, what was happening then (from my understanding on that thread), is that the system ended up getting bogged down by something (spawning numerous processes) trying to access the /dev/log.  Part of it may have been compounded by the part of his log file expanded beyond multiple GB.  Sadly, it doesn't say if they fixed it in that thread, but may give you something to look at. 
> 
> I found a process that has been flooding syslog and I have resolved this problem, but servers are still freezing ...

 

It does not look like a hardware problem.

I have rebooted server, logged in on console via IPMI and waited until server freezed.

When it freezed, I could not login via SSH, neither on console. But the one, where I was already logged in was working fine, so I got access to the system.

What I managed to check:

* system had network connectivity,

* there was no disk issue - I could read and write from/to FS, /proc/mounts showed all disks are mounted RW.

* There was no load reported by `uptime`

* No processes in D or Z state

* I could launch any program I wanted and kill any already working

* There was nothing logged in by syslog until freeze

* There were several processes of "/usr/bin/sudo /usr/bin/doveadm replicator status" and "/usr/sbin/crond" running in the system. First one is used by Zabbix to monitor dovecot replication, 2nd is used by cron to start cron jobs.

* `rc-status` showed all services are running, there was no process missing in the system (listed by `ps aux`)

* I could not take any action using `rc-service`. When I tried to restart sshd I've lost console. Process freezed on stopping SSH daemon and I could not break its execution nor send it to background.

Unfortunately, I opened just one console and when I lost control over it, that was the end of my investigation...

----------

## Belliash

I think syslog-ng is guilty... or some package logging to it...

I could not login to the system, because it was waiting for syslog... SSH sent a message to syslog and whole connecting process was stuck. When I tried to restart service, OpenRC and/or SSH wanted to write something to log about service restart and freezing on stopping service. But when I killed sshd and wanted to start it - it freezed on starting service. There were a lot of processes running in the system waiting for syslog. But when I executed `killall -9 syslog-ng` everything begun to work again!

Do you have any ideas how to investigate syslog-ng? Or just strace left for me?  :Razz: 

----------

## Ant P.

Oh, ouch. I've had a similar thing happen with metalog; it was being noisy and duping logs on stdout, so I naively tried closing the fd before starting it. Turns out it just filled up an internal buffer and blocked… and then so did everything else trying to call syslog(). Lesson learned: always give the syslog daemon a write sink, even if it's /dev/null.

----------

## dufeu

I've had mysterious 'freezing' occur on occasion. On my systems, I've tracked it down to mostly disk io interactions. In fact, I can reliably cause a 'freezing' event pretty much on demand in my environment. However, I don't know how/lack the technical smarts to collect meaningful data to report these events so I haven't reported them.

Moreover, I've found there are more then one set of circumstances which can cause freezing. The main areas appear to involve direct disk io, disk io over ethernet, graphical display and Internet sessions. On the surface, all of these initially look pretty much the same. Some are much longer lasting than others. All require patience to get to the actual causes. While syslog is often useful, most of the 'freezes' I encounter don't report anything to syslog.

One of the things I do on initial boot now is to set up a terminal window running 'top -i -d10'. When a system freezes, this sometimes gives me a clue as to the running process. Any process running at ~100% CPU is a dead give away to a potential 'race' condition. I see this with btrfs and chrome sessions.

I found using the disk io 'iostat' and the nfs disk io tool 'nfsiostat' to be very helpful. For nfsiostat, you need to look at your disk io both at the server and remotely. For X-Windows (graphical issues), the X logs are helpful. There appears to be a general problem with error handling when information about an open window disappears. The session involved seems to go into some kind of race condition and then into a permanent wait state. For chrome, click on a chrome window and then hit Shift+ESC. What you're looking for here is CPU clock time and increasing memory usage. There are quite a few invisible site APIs which are broken including ones provided by Google. However, 3rd party ad servers are the worst offenders. For all browsers which invoke hardware acceleration, I've seen mysterious partial freezes including freezes which prevent me from switching desktops. I have to either force a desktop session switch or log in remotely from another machine and kill that user session.

In general, I've found these specific causes to produce the most crippling freezes:Direct disk IO - Part of the issue with direct disk IO freezes is that it is impossible for the kernel to know everything which is going on with a disk. For example, the disk's internal firmware does not tell the kernel about sector reallocations, error correcting reads or anthing else like this. When these events occur, the disk will automatically try a number of different read stratagies to get/correct the desired data. None of this will appear in syslog. These strategies can take over 300 seconds. Depending on how your data is spread across your drives, some to many to nearly all processes can end up waiting for disk io.And of course, if you try to look at disk io in this state, your attempt to look will freeze too and nothing is reported to syslog.

You may want to look at my post smartctl output format for seagate drives here: I include my smartctl daemon settings and report script. I perform a smartctl initiated short test on all my NAS drives once a day.

For nfs especially, the default of 8 nfsd daemon instances can be unexpectedly crippling. One of the problems here is that a freezing event due to nfsd saturation is completely invisible under most circumstances. You'll only see this by using 'nfsiostat'. On my NAS, /etc/conf.d/nfs has:

```
OPTS_RPC_NFSD="2048"
```

For nfs on the server, I'm using 512 daemon instances per core. I've reached this value through actually testing different values and observing gross data copying transfer rates and using nfsiostat at both ends. The value which works best for you per core will depend on the amount of memory installed and other running applications. I haven't been able to define a 'rule of thumb' for this value other than to say the default is nuts for everyone other than lightly loaded home servers. At this point, I use 128 (32 per core for a 4 core CPU) as the default value when installing a new small office or home server.

I hope some of the above proves helpful in tracking down your issues.

Regarding my ability to 'freeze' on demand. Basically, this is a function of data transfer load between my NAS server and workstation and specific applications used from within my desktop on the workstation. My NAS is configured with ~60 drives split into 10 drive sets using btrfs. Both the server and the workstation have 10Gb Melonox Ethernet NICs attatched to 10Gb ports on a 48/2 port switch. I run my own local bittorrent seedbox. Most applications work fine under normal usage. Some bulk data transfers can result in freezes. These include drag and drop files (regardless of which windows manager I use) from my workstation to my NAS or across NAS hosted file systems. I have no problem opening a terminal window and using 'rsync' to do bulk transfers directly. i.e - on workstation.:

```
rsync -vans -e 'ssh' --info=all0,progress2 /source/directory/*  dest-account@NAS.mydomain:/destination/directory
```

i.e. on NAS server using terminal window logged into NAS:

```
rsync -vas --info=all0,progress2 /Silo02/source/directory/ /Silo05/destination/directory
```

Data transfers initiated this way are never a problem. Put another way, taking nfs out of the process results in success every time. BTW, I don't regard nfs to the be source of the problem here. This is more of an application of "Keep It Simple". See below.

For historical reasons, my seedbox is run on my workstation. The bittorrent work files are local on the workstation. The seeded files are on  the  NAS. The are potentially 4 different places where data fragmentation will occur in this environment. The bittorrent pieces, the nfs fragments, the Ethernet fragments and the btrfs chunks. This fragmentation can occur in both directions. Large torrents can comprise of 20,000 pieces or more. At this time, I download torrents to my workstation and then use rsync from a terminal session to transfer the finished torrents to the NAS. I don't seem to have any problem with seeding in this configuration. The problems I used to have with downloading seem to be restricted only to large torrents. 

However, as noted above, drag and drop of files between the workstation and the NAS or across NAS clusters (from the workstation using nfs) can induce a 'freeze'. A few small files at a time are usually fine. Many or larger files or multiple file transfer instances can be problematical. One way to observe this in a GUI desktop is to have your desktop's Network Monitor widget open. If 'freezing' occurs due to disk io through your network, the widget will show a drop in related network traffic to virtually zero.

One of the things I have to wonder about is the impact of this fragmentation complexity on the virtual file systems used in the windows managers.

The bottom line is to look at the state of your hard drives, your disk IO and your application complexity. Sometimes problem don't appear until your applications scale up.

On a final note, you may also want to look at changing 'ulimit -n'. This is the number of permitted file descriptors per user process. If you're running a lot of user sessions via Web access, from the server's standpoint, that may end up looking like a single user with many open file descriptors. While the default value of 1024 is reasonable for a lot of people, I've had to up mine to 4096. For some applications, there are performance implications depending on the application's internal strategies for managing the number of open files it has.

----------

