# Improving memory manager issues in out of memory situations

## alex6z

This is mostly about how the linux memory manager handles low or out of memory situations, especially concerning mmap()ed (memory mapped) pages.

If you want to set up your computer so that it is always fast and responsive, where the cursor in Xorg is always smooth, just disable swap, and everything will just be in ram and any program that uses too much ram will be killed right? Well it doesn't work that way unfortunately. Have you had your computer run out of memory and it sat there locked up for 10 minutes? This is because of mmap(). You can disable overcommiting, but you have to have a lot of memory to do that because a lot of memory ends up being ineffeciently used because of the excessive mmap() use on sometimes unimportant data by programs. Even if you have overcommiting and swap turned as usual, improving the memory manager's mmap() system could help desktop performance.

Starting a normal program overview

When you start a program in linux, /lib/ld-linux.so.2 is invoked and takes care of setting up the program in memory so that it can run. It mmap()s everything needed for the program and then starts the program.

A program may have the following thing is ram or using these resources:

1a) Program executable code is mmap()ed in to ram. This include the program itself, and any shared libraries. This gets flushed (deleted from ram) in a low memory situation.. If the program needs it again Linux will copy the particular page from the hard drive and back in to ram.

1b) Program read only data is mmap()ed in to ram such as static data that is included with the program, icons, fonts etc. This gets flushed in a low memory situation.

2) Dynamic program data like the stack, heap and any allocated memory from malloc() is given its own section of ram to to occupy. This will be copied to the swap file and removed from memory in a low memory situation.

3) mmap()ed program data. This is data that the program itself has mapped in to memory for better performance and ease of use. An image file to be burned to a CD, or any file that a program reads or writes to can be mmap()ed in to memory. Also, read only data like icons and fonts can be mmap()ed by the program after starting and not by /etc/ld-linux.so. In a low memory situation, this is flushed. Often this data could be read from the file and copied to memory using open() instead of mmap()

So what happens in a low memory situation?

Let's assume that swap is disabled in this example. If swap was enabled, all that would happen is that program data would be moved to the swap file. So just ignore swap because it isn't a problem.

When the system runs low on memory, mmap()ed pages are flushed. Pages which haven't been used an a while are flushed first. Program executable code, data, and any mmap()ed files are flushed from ram, and only remain on the disk in files. Eventually if all the ram is used up everything else will come to a grinding halt because every page of executable code has to be copied from the disk in to ram to run. 

 Of course you know that the disk is much slower than ram, and many thousands of times slower than ram when it has to read data that is not in order where it has to seek to different areas of the disk. If your system runs out of memory and lockes up to where the mouse won't even move, what is happening is that the executable code of Xorg has been flushed to the disk, and it is running page by page from the disk instead of ram. Eventually OOM killer will get invoked and kill the process that tries to allocate more memory. Then things will hopefully free up.

Now mmap()ed executable code, important mmap()ed data, and unimportant mmap()ed all share the same memory. What makes the situation worse is that executable code is not read in order, and data often is. A program will jump around to certain areas as different funcions are called. When paging from the disk this is very slow as the disk has to seek to every location that the executable code needs. When a program is reading mmap()ed data from a file, it is often reading the data in order or in large chuncks which can be read from the disk at a fairly high speed. Whan happens here is that mmap()ed data gets read quickly and pushes other mmap()ed areas out of ram. When an executable program is running, it slowly gets put back in ram page by page. Executable code ends up quickly being flushed again as it gets replaced by the mmap()ed data which is being read.

For example, you have 8MB of ram available to both programs. If you have a program running which has 1MB of mmap()ed executable code and a tiny bit of data, and a small 24kb executable program that is accessing a 100mb mmap()ed file (reading the file in order), the 100mb file can be entirely read and much of it copied in to ram while the 1MB program has only had a few pages copied in to ram and can hardly even run. This is because it takes the same amount of time to read a few pages from the 1MB file as the hard drive has to skip around as it does to read a large section of the 100mb data file, where the hard drive can read in order. As soon as a 4kb page is read for 1MB program the operating system goes back and reads a large chunck form the 100mb file. If 1mb chuncks are being read from the 100mb file, and 4kb pages are being read for the 1MB program, it will take 100 disk seeks to read the 100mb file but up to 250 disk seeks to execute the 1MB program. After a while, the pages from the 1MB program get flushed again and all 8MB memory goes back to being used by the program which is loading the 100mb data files, and the whole thing repeats.

Now the obvious solution would be to make the 24kb program use open() instead of mmap(), so it doesn't ever use more than 1mb of memory. Or, could we figure out some way to give the 1MB program priority in memory so its pages always stay in ram?

A fix?

What if the memory manager could he enhanced with a high_priority_mmap() feature. When this is used these pages would be the last to be flushed. Normal mmap()ed files would be first to be flushed, next dynamic data would be moved from ram in to the swap file, and lastly the high priority pages would be flushed.

What you would do is modify /lib/ld-linux.so so that it uses high_priority_mmap() when loading programs. Then every other program and poorly written program that loads icons, and fonts, and data files, and any other non executable data which is generally read in order would use the normal mmap(). This way executable code would be the last to be flushed, eliminating the problem in the example above where the 1MB program can't run, because its pages would have higher priority and stay in ram while the lower priority pages from the small 100mb file data processing program would get flushed first.

I'd like to also see a feature with the overcommit option, where you can set it so that pages mmap()ed with the "high priority" or "executable priority" and dynamic data (heap stack malloc) only count in the overcommit ratio, and the normal mmap() is allowed to overcommit. This would guarantee that here is enough room for executable code and dynamic data in ram, but a program which mmap()s a large data file will still be able to do so without exceeding the overcommit ratio.

Or the 1st overcommit limit could be set to count high priority mmap()ed pages, then another overcommit limit would count normal dynamic data, and then normal mmap()ed files would be counted on the 3rd overcommit limit. Then the 1st overcommit limit (for high priority pages) could be set to the size of the ram installed in the system, guaranteeing enough room for programs to be in ram without their pages being flushed. Then the 2nd ratio could be set to the size of the swap file, guaranteeing that there is enough room in swap for all their in memory data, such as stack, heap and malloc()'ed memory. Exceeding this limit would result in malloc returning NULL, like it is supposed to, instead of OOM killer killing the program. The 3rd overcommit limit would be for regular mmap() and would be set to a very high value, for when mmap() is used for accessing regular files since there is no harm in overcommiting this type of access.

Of course these values could be changed as needed. There could also be a 1st+2nd overcommit limit that would not allow the 1st and the 2nd overcommit limit added together to exceed a certain limit. You would set this limit to the available ram in your system if you want to make sure that swap is not used and no pages from executables are flushed.

----------

## eccerr0r

TLDRC (too long, didn't read carefully)

Most memory managers use some sort of LRU algorithm to evict pages that were not recently used.  However the only solution if everything is being used is...

Get more memory.

And I think you mentioned that writing programs differently is another way to do this.  More likely than not, this should be the correct way to do this -- why have things in RAM when you don't need it, and write code such that each block of code is as compact as possible so that changing pages happens as little as possible.  This has lots of other benefits, instead of limiting your program to available RAM.

Sounds like what you're proposing is a tiered memory allocation system.  While this sounds good and helps the OS know which pages should be kept and which can be evicted, I think what you're saying is mostly there already -- the kernel knows which pages are executable and which are data.  Also there's the possibility that huge executable apps have a _lot_ of unused code (bloat?) and just a small fraction of the code is in use at any one time.  Should all of these text pages be marked 'high priority' when loaded and not be flushed when another small app needs more memory to be faster?

I think LRU typically does do a decent job in most cases, but can end up with thrashing as stated before (all depends on usage pattern).  But if LRU is doing the right thing and over all pages are being accessed such that LRU does the right thing, it should work well... Else what really is needed is...

Get more memory.

----------

## alex6z

I think the current implementation does a great job of obtaining maximum performance for a single process that is running. The page flushing system works perfectly for this.

 *Quote:*   

>  I think what you're saying is mostly there already -- the kernel knows which pages are executable and which are data.

 

I don't know the internals of the kernel, but you mean that the kernel will flush non executable pages before flushing executable pages correct?

 *Quote:*   

> Also there's the possibility that huge executable apps have a _lot_ of unused code (bloat?) and just a small fraction of the code is in use at any one time. Should all of these text pages be marked 'high priority' when loaded and not be flushed when another small app needs more memory to be faster? 

 

High priority pages will still be flushed, it's just that mmap()ed data and allocated memory (more data) will be flushed first. The actual amount of executable in a program is quite small compared to the amount of data that it loads isn't it? It is much more efficient to load data from the disk since data is usually read in large chunks or in order, whereas executable code is loaded from the disk very slowly page by page. Therefore executable code should have a higher priority.

You say get more ram. Well of course, but that isn't always an option. If everybody could get as much ram as they needed nobody would have a swap partition on their linux box.

----------

## pilla

Moved from Gentoo Chat to Kernel & Hardware.

----------

## x22

There already is a system call which gives hint to the kernel about the expected use of some memory, madvise.

(And same call for files, "fadvise".)

----------

## timeBandit

 *alex6z wrote:*   

> The actual amount of executable in a program is quite small compared to the amount of data that it loads isn't it? ... Therefore executable code should have a higher priority.

 Generally, no. Most Linux applications delegate a ton of their code to shared libraries and you have to consider that as part of the executable footprint, even if the memory is shared between apps. Also, when discussing GUI applications, much of the data relates to rendering of the UI (fonts, pixmaps, etc.) and should be given almost equal priority to executable code, to maintain good interactive response.

Really what you've described is an overstressed system and there is no magical kernel-level fix for that. The complementary answer to "get more memory" is "use less memory." This can be accomplished be re-coding applications to be smart about their working sets and be efficient in low-memory environments, or by simply not trying to overload available RAM.

----------

## alex6z

 *timeBandit wrote:*   

> Generally, no. Most Linux applications delegate a ton of their code to shared libraries and you have to consider that as part of the executable footprint, even if the memory is shared between apps. Also, when discussing GUI applications, much of the data relates to rendering of the UI (fonts, pixmaps, etc.) and should be given almost equal priority to executable code, to maintain good interactive response.

 

Well see what I was thinking is that when a program reads flushed pages from mmap()ed data like pixmaps, fonts, etc., I usually it is reading the ENTIRE file IN ORDER. It reads the entire font file or pixmap at once. This greatly helps performance because modern hard drives read sequential data very quickly. So the time taken to load it back in to ram is relatively small.

But when you have executable code, be they shared libraries or from the program itself, they are not executed an entire file at a time and in order. There are function calls which a thrown about and randomly called. A few 4kb pages from one spot get loaded, then another spot, then another, then in a different library a few pages are needed, then another few pages. Each time the hard drive has to seek to different parts of the file and to many different files in many different shared libraraies. If one library calls a function from another library, it has to seek clear across the disk to a different file and load a few pages in ram, and then go back. An entire font or pixmap could have been loaded in that time.

Since executable code takes so much longer to load back in to ram after it is flushed, that's why I think mmap()ed data should be flushed before executable code.

I just did an experiment. I copied 812KB of pixmaps from /usr/share/pixmaps in 0.39 seconds to /dev/null on a really old hard drive. How long would it have taken to load that much data through page faults from running typical executable code?

----------

## eccerr0r

 *alex6z wrote:*   

> 
> 
> I just did an experiment. I copied 812KB of pixmaps from /usr/share/pixmaps in 0.39 seconds to /dev/null on a really old hard drive. How long would it have taken to load that much data through page faults from running typical executable code?

 

It takes the same amount of time to fetch 4KB of data versus 4KB of text from disk.  A cache miss is a cache miss.  The performance degradation totally depends on the access pattern of the text or data.  If the text that's being executed has a small basic block that neatly fits in a few pages of ram, having them constantly evicted would have a detrimental effect on performance.

However if the program is JMPing around or sleeping a lot, then this is different.  Unused pages can be evicted, a particular page won't be used again.

Contrast this with a program fetching and comparing data in a large piece of memory... perhaps some program trying to look for patterns in a large file that needs to check data versus data.  If you evict the data that the program is accessing, it too will have to pay penalty every time and you will feel a huge speed degradation in performance.

All in all, people should *not* end up in this situation where it comes down to the last few bytes of RAM "cache".  If there is thrash, it's because the programs running is not efficiently using text or data, or there is simply not enough RAM in the system.  LRU is fairly well optimised for all of these situations.

The only corollary to this is a general problem people have when writing GUIs.  GUIs and any other type of HCI (human-computer interaction) is the absolute worst workload a computer has to deal with.  While HCI has the longest delays as humans are *slow* compared to GHz of speed, it *must* quickly respond to any event that the person invoked else the *human* thinks the computer is slow.  Fair? Not at all but we do it anyway.  In order to make the computer "feel" faster but *SLOWER* over all, HCI code needs to be optimised such it has priority in RAM, gets executed with high priority, even if means slowing down the computer from finishing the task overall.

Makes sense?

Maybe all HCI code needs to be flagged to not be evicted.  That, and HCI code written such that it has high priority or called back more frequently such that if there is a high computation event, the HCI code does not get ignored and the USER doesn't "feel" the computer is slow...

People suck.

----------

## alex6z

 *x22 wrote:*   

> There already is a system call which gives hint to the kernel about the expected use of some memory, madvise.
> 
> (And same call for files, "fadvise".)

 

Thanks. That controls caching and read-ahead which improves performance, in situations where there is plenty of memory available. Read ahead adjustment helps optimize loading flushed data or executable pages (little read ahead for executable pages is best). And caching improves performance of everything by keeping it in ram (provided there is no shortage of memory). However when there is a memory shortage, what I am suggesting is that data pages get flushed first since they are the fastest to load again when needed.

----------

## alex6z

 *eccerr0r wrote:*   

> 
> 
> It takes the same amount of time to fetch 4KB of data versus 4KB of text from disk.  A cache miss is a cache miss.  The performance degradation totally depends on the access pattern of the text or data.  If the text that's being executed has a small basic block that neatly fits in a few pages of ram, having them constantly evicted would have a detrimental effect on performance.

 

Right, but data isn't accessed in single 4kb pages (unless you have a bunch of 4kb memory mapped icon files to load or something). Executable code is what creates page faults on randomly spread out pages that take you hard drive a long time seeking to read them all.

 *Quote:*   

> 
> 
> However if the program is JMPing around or sleeping a lot, then this is different.  Unused pages can be evicted, a particular page won't be used again.

 

Yes. My idea would still evict text pages, but unused data pages would be evicted FIRST. Then it would evict unused the unused text pages. This will be much better for the user of the computer as it will take much less time to read the data pages (which are read in order usually) pack in to ram when needed that it would to load the text pages (which are loaded from seemingly randomly spread out page faults across many different shared libraries and other executables) .

 *Quote:*   

> 
> 
> Contrast this with a program fetching and comparing data in a large piece of memory... perhaps some program trying to look for patterns in a large file that needs to check data versus data.  If you evict the data that the program is accessing, it too will have to pay penalty every time and you will feel a huge speed degradation in performance.

 

Yep that's one rare case where this method would not be helpful and in fact detrimental. Most desktop applications are the opposite of this.

 *Quote:*   

> 
> 
> All in all, people should *not* end up in this situation where it comes down to the last few bytes of RAM "cache".

 

Well everyone should not have swap partitions then ... if the solution is always to get more ram  :Smile:  Why don't we just get rid of the whole paging system completely then and have everything in ram and it can be like Windows 3.1  :Smile:  mmap() can be replaced with malloc() and a small function that copies the file in to ram.

 *Quote:*   

> 
> 
> People suck.

 

 :Smile: 

----------

## eccerr0r

 *alex6z wrote:*   

> 
> 
> Well everyone should not have swap partitions then ... if the solution is always to get more ram  Why don't we just get rid of the whole paging system completely then and have everything in ram and it can be like Windows 3.1  mmap() can be replaced with malloc() and a small function that copies the file in to ram.
> 
> 

 

I think the confusion is the difference between useful swap versus a situation where you're really out of RAM based on current utilisation.

LRU should handle swap just fine.  Except when the machine is forced to constantly thrash pages or is confused by the HCI issue (or some pathological case, which almost any heuristic case would have.)  On the pathological case you're forced to get more RAM or change your software.  And again, the HCI is the only perception of 'slow'.  If the computer just has too much working set for available RAM...that's the best it can do, swapping out a text or data page doesn't matter.  LRU will tend to pick data pages if it really isn't being used often.

----------

