# Gentoo much slower than identical hardware with Ubuntu

## hertfelder

Dear all,

I have found out, that my PC with 4.4.6-gentoo is considerably slower when running our CFD-codes than that of a colleague which has exactly the same hardware, but Ubuntu as an OS (mine is slower by ~50 %). The slow-down is most dramatic when a large number of grid cells are used, i.e. when the arrays are very large and a lot of memory access is performed. I verified that there are no hardware issues on my maschine by booting from a Ubuntu Live CD and running the test. Here, I got the same fast results as my colleague.

I suspect that the problem is due to my kernel config. However, I don't know where to start looking and I also don't know what I should be looking for. Maybe somebody can help me here.

kernel config: https://bpaste.net/show/d2e8adf56bed

lshw: https://bpaste.net/show/ceea4a8e0156

emerge --info: https://bpaste.net/show/e6843577ddfa

Thanks,

Marius

----------

## NeddySeagoon

hertfelder,

Welcome to Gentoo.  This looks a bit odd.  Its a list of all the CPU frequency controllers your kernel knows about.

```
#

# CPU Frequency scaling

#

CONFIG_CPU_FREQ=y

CONFIG_CPU_FREQ_GOV_COMMON=y

# CONFIG_CPU_FREQ_STAT is not set

# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set

# CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set

CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y

# CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND is not set

# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set

CONFIG_CPU_FREQ_GOV_PERFORMANCE=y

# CONFIG_CPU_FREQ_GOV_POWERSAVE is not set

CONFIG_CPU_FREQ_GOV_USERSPACE=y

CONFIG_CPU_FREQ_GOV_ONDEMAND=y

# CONFIG_CPU_FREQ_GOV_CONSERVATIVE is not set
```

The 

```
CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y
```

says that by default, the kernel is relying on userspace to control the CPU frequency, so the question arises about what is userspace doing, if anything.

Run the Gentoo/Ubuntu compare again and look at the CPU clock speed in both during the test.

Read the kernel context sensitive help on the various governors.  You can switch CPU governors by poking about in /proc.

----------

## chithanh

Modern Intel CPUs perform better with intel_pstate instead of the frequency scaling drivers. Choose the performance governor and enable CONFIG_X86_INTEL_PSTATE.

----------

## 1clue

CONFIG_CPU_FREQ_GOV_ONDEMAND seems to work fine for me with an atom c2758, 8-core. And CONFIG_X86_INTEL_PSTATE of course.

----------

## YukiteruAmano

Activate this option on your kernel, CONFIG_X86_INTEL_PSTATE.

With new Intel CPU, PSTATE is the best control scaling driver for them, using userspace, you have a lot problem with performance. 

The others scaling driver you switch off.

----------

## hertfelder

Thanks a lot for your replies!

I see your point concerning the CPU frequency control. I messed that up when configuring the kernel for the first time.

There is no need for userspace control in my case. Therefore, as adviced, I switched to intel_pstate and switched of

the other drivers. This indeed improves the performance of the CPU, my test cases are running faster by ~10 % now.

However, it's still slower by ~50% as compared to the identical setup with Ubuntu. I have the feeling, this is somehow

memory (access) related, since the CPU itself is now as fast as the other system (eg. sysbench cpu test). So, do you

guys have any advice or idea where to look for additional misconfigured kernel options?

Thanks!

----------

## chithanh

You are using INTEL_PSTATE with the performance governor, yes?

If you are on a NUMA system, then you may want to enable CONFIG_NUMA_BALANCING. Then there is transparent hugepages support but that will benefit only certain use cases and/or software specially written for it (but worth a try in your case I guess).

I don't see anything else obviously wrong with the kernel config. Some things are quite unusual like missing CONFIG_FHANDLE but should not impact performance. You could disable CONFIG_DEBUG_KERNEL or CONFIG_KPROBES, but that would give only small performance gains.

What you could do is start at Ubuntu kernel config and then work your way towards your current one, checking at which point the performance decreases.

----------

## hertfelder

Thanks chithanh, the transparent hugepages support did the trick  :Smile:  Now, my machine is faster by 10-15 % compared to the identical

Ubuntu machine. Do you have a link or some advice where I can read up on this hugepages stuff? I am quite sure that neither software

I used for testing is intentionally exploiting this feature. Therefore, I would be interested to know why they can benefit so much from

the hugepages  support.

For reference, I am now using INTEL_PSTATE with the performance governor, enabled NUMA_BALANCING, FHANDLE and disabled

DEBUG_KERNEL and CONFIG_KPROBES.

Thanks again!

----------

## chithanh

Here is a LWN article about transparent hugepages: https://lwn.net/Articles/423584/

But I don't have any deep knowledge on the subject, so probably you would have to tap someone else's expertise if you have questions.  :Smile: 

----------

## pilla

There is two different kinds of memory addresses: logical and physical. 

Logical addresses are what processes see. When using pages as the memory management technique, this logical address space is just a bunch of contiguous addresses that are divided in pages (usually of the same size). Any logical address can thus be divided in two parts, number of page and an offset. This is made in a transparent way to the process, so you don't have to worry about it.

Physical addresses are used to address the physical RAM. When using pages, they are divided into frames of the same size as the logical pages. The address is also divided in two parts, frame and offset. 

Now we have to map logical into physical addresses in order to effectively access the memory. That means mapping pages into frames. Pages from a process may be mapped into any frame in any place of the physical memory. Hence, a logically contiguous process may have its pages scattered all around the memory.

Notice that the offset needs no mapping. That part of the address is just copied from the logical to the physical address, constituting the less significant bits of it. 

How is that mapping done? There are different implementation details to it, but there is basically a page table (usually one for each process) in which an entry contains the frame id for a given page (among other bits, that are not important for us now).  This table resides in memory, of course, which makes it very slow to access when we consider that for every memory access it must be accessed too. So if your memory takes 100ns to be accessed, then we are talking about doubling this time in order to account for the extra access to the page table every time a load or store is made, or even for fetching an instruction (if they aren't already in cache). This is not even the worse case, as operating systems are using multi-level page tables that require extra accesses......

Then there is a thing called TLB, Translation Look-Aside Buffer, basically a small completely associative cache for page tables. It resides inside your processor so it is fast, but also small. Every time you don't get a hit on that TLB, there goes the memory management unit to fetch it from the page table.

Going back to pages and frames. The smaller they are, the lesser waste of memory due to processes not using their entire capacity. But that means more entries in the page table, and more misses in the TLB for many programs. Hence, increasing the size of pages reduces the amount of misses in the TLB, thus improving performance. If you have enough memory to afford for the waste it may cause in the cases where pages are too big for some processes, then it may be a good alternative for your system.

There are other issues such as having to move pages from and to disk. Larger pages that have little useful content have to be read and written to disk by virtual memory, which may not be efficient in these cases.

----------

## hertfelder

Thanks for the link, chithanh, and the explanation pilla! I think I am starting to get a feeling for it.

This might be quite interesting for our applications; I will have a look if I can play around with the

options (page size,...) a little.

Marius

----------

## pilla

 *hertfelder wrote:*   

> Thanks for the link, chithanh, and the explanation pilla! I think I am starting to get a feeling for it.
> 
> This might be quite interesting for our applications; I will have a look if I can play around with the
> 
> options (page size,...) a little.
> ...

 

Any good textbook in Operating Systems will give you many good insights on the basics. You can skip directly to "Memory Management" and "Virtual Memory" for the specifics of memory that you are interested. I favour books authored by Silberschatz.

Hennessy&Patterson's Computer Organization and Design: The HW/SW Interface is a good read if you want to know more about memories and caches, but it is quite deep in the computer architecture stuff. Probably not interesting if you did not have some introductory courses in computer architectures before it.

----------

## hertfelder

Thanks for the suggestions, pilla!

I checked our library's outfit and found the Operating System Concepts by Silberschatz. This looks like a good read!

----------

## pilla

 *hertfelder wrote:*   

> Thanks for the suggestions, pilla!
> 
> I checked our library's outfit and found the Operating System Concepts by Silberschatz. This looks like a good read!

 

You are welcome. If you have doubts I might be able to help. I give lectures in Operating Systems.

----------

