# Memory exhausted: Desktop stalls instead of oom killer

## Apheus

Hello,

I noticed an annoying problem on at least two gentoo machines, bare metal and virtualbox guest: If memory is exhausted by a user process, the complete desktop becomes unresponsive except for mouse cursor movement. One could suspect swap thrashing, but this happens also without any swap memory. Any action takes minutes to provoke a response, if at all. Sometimes, only Alt+PrintScr+REISUB is possible. In case of the virtual machine, I can see hard disk activity on the host in task manager at this times. Even when the guest does not have any swap, and I have no idea where the guest system reads from or writes to at that moment. I did never have a data loss or anything interesting on the hard drives after a shut off from that state - everything is clear after REISUB, or recovers with "recovering journal" after a hard shut off.

Even ssh login as root from another machine takes ages to show the login prompt - if at all.

Happened yesterday in the virtual machine while emerging palemoon with MAKEOPTS -j 6 and $PORTAGE_TMPDIR in tmpfs - 10 GiB is not enough apparently. I was able - after long waits - to switch to tty1, login and kill the emerge. At that moment, the oom killer killed firefox. Which was neither necessary nor useful at that point in time.

Another test case is the Google Chrome resource exhaustion discovered in the wake of the iMac/iOS CSS bug: https://s3.eu-central-1.amazonaws.com/sabri/chrome-reaper.html (CAUTION with that link of you use Chrome/Chromium et al).

I have a memory limit for the desktop session via cgroups. When I set this up years ago, it was intended to protect root processes (like ssh) from that behavior of desktop processes. A cgroup named "gui" is configured in /etc/cgroup/cgconfig.conf and gets assigned via two lines in /etc/X11/startDM.sh. But the problem persists even if I switch that assigment off. I can only use memory.limit_in_bytes to control the amount of space the desktop processes get before the problem starts.

Desktop is KDE. I would expect problems if kwin_x11 cannot get memory, but still the oom killer should kick in. And it should kick in pretty fast.

Any ideas?

Edit: Typo in path to cgconfig.conf

----------

## eccerr0r

Yes it's still "swapping" to do the best it can to prevent an OOM situation, even if you don't have swap.

You should add real swap or add more memory.  Not using your GUI when running memory intensive stuff will help the kernel reach true OOM faster.  

Ultimately you're still asking the kernel to put 11 liters of water into a 10 liter jug and all it's trying to do is prevent OOM from killing any programs until the very end when RAM is full of data it cannot figure out how to get elsewhere.

----------

## Apheus

Thanks.

 *eccerr0r wrote:*   

> Yes it's still "swapping" to do the best it can to prevent an OOM situation, even if you don't have swap.

 

But where? It won't say "Oh, root partition is mounted rw and has 35GiB free space, let's create a swap file there".

 *Quote:*   

> Not using your GUI when running memory intensive stuff will help the kernel reach true OOM faster.  

 

Why is that?

 *Quote:*   

> Ultimately you're still asking the kernel to put 11 liters of water into a 10 liter jug and all it's trying to do is prevent OOM from killing any programs until the very end when RAM is full of data it cannot figure out how to get elsewhere.

 

That's not what I want if the machine is unresponsive while the kernel is busy "figuring out" and this takes minutes or forever. Could the "figuring out" exhaust cpu? The "unable to login with ssh" feels like some kind of fork bomb running, and unprotected by dynamic cpu cgroup scheduling.

What I want is the oom killer to have a loose gun on the one responsible process.

I have experimented some more in the last days, with another test case: Thumbnail generation with imagemagick's "magick montage". Putting this to work on many image files takes Gigabytes of memory. The situation seems to be better if I set both memory.limit_in_bytes and memory.memsw.limit_in_bytes in cgconfig.conf to a value at or below the physical memory. Of course the processes can never use swap at all with this configuration. It is possible to allow some swap with a slightly larger memsw.memory_limit_in_bytes value. The unresponsive wait time before oom gets longer in that case because of hard drive access times. That is: Several seconds. Acceptable, compared to a complete hang.

The google chrome test case is not as comfortable: Oom does not kick in, but I can still manage to switch to tty1, login and kill the process. No REISUB necessary.

----------

## khayyam

Apheus ... from the various details I'm inclined to think that scheduling is the underlying issue, I suspect you're using CFS (Completely Fair Scheduler), and CFQ (IOSCHED_CFQ), for I/O scheduling (probably also CFS_BANDWIDTH, CGROUP_SCHED, etc for cgroups). If that is the case then you might find better scheduling with MuQSS (which is the redesigned/updated/refactored BFS), and BFQ (Budget Fair Queueing I/O scheduler). Anyhow ...

 *Apheus wrote:*   

> Happened yesterday in the virtual machine while emerging palemoon with MAKEOPTS -j 6 and $PORTAGE_TMPDIR in tmpfs - 10 GiB is not enough apparently.

 

Yes, because once you've unpacked your souces into tmpfs (palemoon is huge!!) there isn't going to be much left of the asigned memory for your VM. Added to this, cgroups limits the file system cache in conjunction with memory (at least it did, don't use cgroups myself) and so that 10GiB may not function fully as a cache (or underfunction, because the cgroup is limiting the available memory, and what is stored in cache). That's partly speculation, as I think the major issue here is the cpu and I/O scheduling. 

best ... khay

----------

## NeddySeagoon

Apheus,

The kernel has several mechanisms for swapping.

The swap partition only allows the kernel to swap dynamically allocated RAM, so by not having any swap space you rob the kernel of that option.

tmpfs can also be swapped to swap.

As the kernel memory maps code to RAM and doesn't load anything until there is a page fault, when it tries to use it, (the page isn't loaded) it swaps by dropping some of the loaded code.

It will reload it later, if its needed. This dropping code and reloading is swapping.

Dirty buffers can be written to disk to free RAM.

Caches can be cleared and reloaded.

All in all, not having swap is a bad idea. All this other swapping goes on, you just don't see it.  A few MB of swap permanently in use is harmless. Much more, and you need more physical RAM.

Swap space not being used is an indicator that all is well.

----------

## Apheus

Thanks.

 *khayyam wrote:*   

> I suspect you're using CFS (Completely Fair Scheduler)

 

Yes.

 *Quote:*   

> and CFQ (IOSCHED_CFQ), for I/O scheduling

 

Only on rotating disks on bare-metal. The virtual machine has deadline. The SSD on the bare-metal machine has noop. That's where / and swap are. But all my experiments in the last days happened in the virtual machine.

 *Quote:*   

> probably also CFS_BANDWIDTH, CGROUP_SCHED

 

```
CONFIG_CGROUP_SCHED=y

CONFIG_CFS_BANDWIDTH is not set
```

 *Quote:*   

> you might find better scheduling with MuQSS (which is the redesigned/updated/refactored BFS), and BFQ (Budget Fair Queueing I/O scheduler).

 

I wil try this, at least in my virtual machine.

 *Quote:*   

> Added to this, cgroups limits the file system cache in conjunction with memory (at least it did, don't use cgroups myself) 

 

Yes, this can be watched in system monitor or kinfocenter with a limit well below physical memory: The kernel clears filesystem cache to not exceed the limit.

----------

## Apheus

I tried =sys-kernel/ck-sources-4.14.69 with BFQ and MuQSS in the virtual machine. Since google chrome fixed the reaper bug, my test case is the new firefox version from reaperbugs.com. Firefox 60.2.1 with empty test profile.

I removed the limit for the "gui" cgroup in cgconfig.conf. 10GiB physical memory, 250 MiB swap partition.

At first, the behavior seemed better: Kinfocenter with the memory charts remained responsive. At one point I could see half the swap space used. I could Alt-Tab between windows, with ~1 minute delay. KDE marked the firefox window as "Not responding". But after some minutes, everything came to a halt and I had to Alt+PrntScrn+REISUB.

----------

## khayyam

Apheus ...

hmmm ... ok, can you now test with the following:

```
PORTAGE_NICENESS="15"

PORTAGE_IONICE_COMMAND="ionice -c 3 schedtool -D \${PID}"
```

You will need sys-process/schedtool.

I have practically no experience with VM's, and so I can't say if this is the best option for testing, or not.

best ... khay

----------

## Apheus

 *khayyam wrote:*   

> 
> 
> ```
> PORTAGE_NICENESS="15"
> 
> ...

 

Does not seem to help, but the results are somewhat unstable, and I have another suspect now: Setting swappiness to a slightly higher value leads to the memory hog process running into a segmentation fault. I need more time to test this thoroughly.

I tested the firefox reaper, without limits, with swappiness 5, 250 MiB swap. With or without ck kernel+BFQ+MuQSS vs. gentoo kernel (CFQ, deadline). And with or without the niceness/scheduler commands:

```
nice -n 15 ionice -c 3 schedtool -D -e firefox...
```

In every of these four runs, firefox crashed with a segmentation fault after ~1 min of very bad desktop responsiveness. Except for the ck kernel+niceness command combination: The unusable-desktop time was longer in that case. Felt like 5 minutes.

The only difference to all the runs in the last days I can think of is swappiness: Had been 0 or 1, is 5 now.

----------

## khayyam

Apheus ...

I'm reminded why I'm still on 3.12.x, I got sick of the near constant instability when tracking "stable". But, yes, looks like a deeper problem there, try and reproduce with 4.14, or 4.9 (not sure what kernel you have currently). Sorry I couldn't be of more help.

best ... khay

----------

## devsk

 *Quote:*   

> $PORTAGE_TMPDIR in tmpfs

 What's the tmpfs size?

Your system is thrashing (and not swapping, which is the act of taking physical pages and writing to the swap media) i.e. its wasting all CPU cycles looking for free pages and its not finding many. All it can do is throw away program pages out (because it can load them back from disk in when needed), use freed up pages for build programs (compiler/linker/scripts) and load them back in when build programs are done. Its doing this repeatedly.

Given a long time, it may work and eventually finish your emerge but it can also easily just spin and spin, and not find any free pages and hang forever.

You need to size your system. Either throw in more RAM or configure a larger swap. Your total committed address space usage has be within RAM+swap. You can not exceed it normally, without paying the penalty of some form (scale goes from tolerable to forever hang) of thrashing.

Do this in a VT while your compile is running: 

```
while true;do egrep Committed_AS /proc/meminfo ;sleep 1;done
```

Note the highest value it gets to over a period of time. Configure swap for that value minus the RAM you have. Or just configure that much RAM for a smooth operating system. You can throw in a fudge ratio to lower the amount but more you lower, more thrashing you will see on peak loads.

----------

## Apheus

 *devsk wrote:*   

>  *Quote:*   $PORTAGE_TMPDIR in tmpfs What's the tmpfs size?

 

Had been 10GiB, but I have it on disk now.

Thanks for the hints, but with PORTAGE_TMPDIR on disk everything is large enough here for normal operation. On my main desktop I even have 32GiB RAM.

I am more interested in the possibility of any user process to stall the machine. The browser reaper bugs should be a DOS attack on the browser process, but not on the whole system.

I did some final experiments, and can confirm: Higher swappiness seems to help. With swappiness=60, I cannot reproduce a complete hang on bare metal. Setting limits seems to lead to worse wait times before oom/segfault of the hog process. On the other hand, in the vm with awefully slow disk io, without limit the thrashing can go on longer than my REISUB-patience. Setting limits below physical memory amount leads to wait times of some seconds.

I always thought swappiness should be low on a desktop system. Years ago, a perl memory-hog one-liner could stall the system exactly like that with swappiness=60, and I had to set it to 20. This seems to be reversed now.

Thanks for all the help. Enough experimentation. I have the following configuration now:

250MiB swap partition,

60 swappiness,

No memory limits on bare metal,

In VM:

Half of physical memory limit for the "gui" cgroup (memory.limit_in_bytes=5G)

Physical minus 2GiB memory limit for the cgroups where emerge runs normally (memory.limit_in_bytes=8G).

PORTAGE_TMPDIR on disk.

----------

## devsk

lower swappiness => hunt for free pages at the last minute

higher swappiness => hunt for free pages at the earliest memory pressure

Higher swappiness is amortizing the cost of free page hunt over a longer period of time and on under configured systems, is the right choice. Lower swappiness is better for well (over) configured system because you will never encounter even the tiniest of hiccups because of page hunts. "Free page" hunts are world stopping events because they can happen inline in the same process or by kswapd in the kernel. Depending on conditions, they can hard block for a while.

----------

