# Does hyperthreading work ?  Some data.

## sublogic

TL;DR: hyperthreading does speed things up a bit, but you won't notice unless you look carefully.

Okay, so I have an "Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz" with 6 cores and I wondered if I should enable hyperthreading to get 12 logical processors.  I couldn't find any benchmarks on the web so I thought of rolling my own.  What compute-bound benchmark would be relevant to Gentoo ? Compiling stuff of course !

So I hacked a little script to "ebuild configure" a package of my choice, time the "ebuild compile" command and run "ebuild clean" afterwards.  My /var/tmp/portage is on tmpfs so that should be as compute-bound as I can get.  Also the "compile" is the most parallelizable phase of an emerge.  I chose a suitable package from the "qlop -mc" output.  Here is the result for media-gfx/gimp, five runs without and with hyperthreading.

```
# nc= number of logical cores

# nt= number of threads (-jN)

# real= seconds of elapsed time

# user, sys= user and kernel mode seconds reported across all cores

#

#nc   nt      real      user       sys

  6    6    275.04    993.10    373.40

  6    6    275.13    992.90    373.43

  6    6    276.05    991.80    375.47

  6    6    275.62    992.46    374.39

  6    6    275.24    992.35    373.89

 12   12    236.91   1543.26    602.58

 12   12    236.04   1541.12    607.83

 12   12    236.36   1540.28    604.38

 12   12    236.59   1541.95    604.91

 12   12    236.57   1539.54    607.14
```

I can post the scripts if people are interested.  The elapsed times are quite reproducible, there is no need to time multiple runs.  Hyperthreading reliably cuts 14% of the compile time.  Not large, especially since this only a part of a full emerge, but greater than zero.  (I'm not sure what to make of the "user" and "sys" columns.)

So hyperthreading might as well stay on.  Next, I kept the 12 logical processors and varied the "-jN" option in the MAKEOPTS.  Still media-gfx/gimp.

```
# nc= number of logical cores

# nt= number of threads (-jN)

# real= seconds of elapsed time

# user, sys= user and kernel mode seconds reported across all cores

#

#nc   nt      real      user       sys

 12   20    235.69   1558.66    611.01

 12   19    235.72   1555.99    613.49

 12   18    235.44   1554.22    611.88

 12   17    235.18   1551.52    611.81

 12   16    235.18   1551.39    610.61

 12   15    235.73   1547.38    611.19

 12   14    235.51   1542.75    611.54

 12   13    235.65   1538.30    611.96

 12   12    236.21   1545.65    601.42

 12   11    239.66   1474.12    577.61

 12   10    245.15   1398.95    545.85

 12    9    250.26   1318.03    505.49

 12    8    256.76   1228.77    465.41

 12    7    264.12   1132.32    421.14

 12    6    273.61   1037.90    368.54

 12    5    307.80   1011.94    350.86

 12    4    359.32    981.96    340.98

 12    3    457.21    969.74    335.95

 12    2    661.29    966.37    340.39

 12    1   1284.14    956.77    351.45
```

The kernel is SMT-aware and schedules threads on physically distinct cores if possible.  The elapsed times fit pretty well to a formula t = t_s + t_p/N_threads , for N_threads ≤ 4.  Too bad I can't post a graph on the forums.  Here t_s is the serially constrained running time, about 47 seconds, and t_p is the parallelizable part of the single-thread runtime, about 1236 seconds.  That's 96% parallelizable !

Above "-j4" the times are a bit longer than predicted.  With "-j5" I get about 4.7 "effective threads" and with "-j6", 5.4 effective.  I speculate that the threads begin to compete for the L3 cache (12 MB shared).  Each core has L1I, L1D and L2 caches of 32 kB, 32 kB and 256 kB.  To prove or disprove this cache hypothesis I would have to take a deep dive in dev-util/oprofile, and that will have to wait.

Above "-j6" the gains are minimal because threads start sharing physical cores.  By "-j12" I get about 6.5 "effective cores".  Above "-j12" the curve is flat, with a very slight but detectable improvement over "-j12".

So, it's real but not big.  Also note that the "unpack" "prepare", "configure" and "qmerge" phases of an emerge typically don't parallelize.  The small hyperthreading gain is limited to the "compile" phase.

Conclusion:Compilation is highly parallelizable (for this package anyway).

Hyperthreading speeds up the compile phase a bit, might as well turn it on.

The recommendation of "1 + the number of logical processors" for the "-j" option is correct !  But just barely.  You can leave out the "1+"

Maybe I'll try "MAKEOPTS=-j6 --jobs=2" or "MAKEOPTS=-j4 --jobs=3" to parallelize the "unpack" and "configure" phases across multiple packages.  If I don't run out of RAM (32 GB).

Enjoy.Last edited by sublogic on Fri Sep 02, 2022 12:06 am; edited 1 time in total

----------

## eccerr0r

Yes it has been shown multiple times that SMT helps, at the very least it will reduce the amount of work the computer needs to do when doing context swaps*.

The first time I noticed SMT helping was running the distributed.net client running RC5-64 on a Pentium4-HT.  Despite the speedup was quite low, but it was significant, I recall at least around 10% or so over running single threaded.  14% is very reasonable.

However it sort of depends on what you're trying to run at the same time.  Usually you get better overall speedup if the two processes do different things.  Gcc compiling seems like a reasonable compromise, there's a lot of things done while compiling, but parallelizing with something that does floating point might get even more overall throughput than two gcc runs.

Note that this is for SMT/HT processors.  Some multithreaded processors are not simultaneous/"Hyper" and will not get 10% but more like 5% or so.

BTW user/sys times are higher because when running 6 threads, the 6 other threads are idle.  Running 12 threads, each thread is slower but are running much more than idle, and hence will count higher utilization.  Also if you're paranoid with the recent data leakage bugs, you might not want to have HT turned on.  I left HT on, it's a significant increase, if one could make 14% more money or buy things with 14% off, who wouldn't.

* The real test is to run 12 threads on your HT processor with HT turned on and off in BIOS.  Here you will see the exact same software running, apples to apples comparison and how much context swaps add up...

----------

## jpsollie

the idea between HyperThreading is:

A CPU has a pipeline to execute a certain instruction, let's say a+b = c.  This pipeline can take many steps, let's say 10 or so.

As long as it doesn't know c, it can't execute the next instruction if that one is required, eg: c * d = e.  As such, it has to wait until the instruction result has left the pipeline (10 steps).

So, hyperThreading would jump to another code sequence (thread) where the cpu can actually do something useful, eg execute a+b from another thread.

This context switch requires that the other code sequence is actually stored in a second set of variables, eg a second a, a second b, and so on.

This context switch is expensive, by the time it starts executing the a+b = c from your other thread, your original instruction a+b = c from the first thread will already be halfway the pipeline.

Also, compilers have done a pretty good job these days by making sure the result of the 1st instruction (a+b =c ) isn't required by the 2nd (c * d = e). 

They reorganise the program flow so it can keep calculating eg (a * d + b * d) = (x + y) = e requires 1 time waiting for both x and y to be ready before it can calculate e, whereas a + b = c, and c * d = e requires the cpu to wait twice.

The effect of hyperthreading will always be some kind of improvement, but as the context switch is expensive, you won't get twice the CPU power.

----------

## eccerr0r

I wouldn't make the conclusion that context swapping makes SMT inefficient but rather the opposite, it makes any multithreaded processor more efficient when running multiple (especially coarse weight) threads in parallel.  Caches (L2/L3, not L0/L1) make a huge difference when dealing with context swaps but you're dumping a lot of crap when a context swap occurs.

Ultimately you're still sharing resources between sibling threads and that's why you will never get 2x performance.  Only a complete set of resources will get you 2x performance and that's would be another complete core.

Recently I got pissed off when someone's ad for their computer claimed their Intel i3 was a 4-core processor.  *facepalm*

BTW I'd say MAKEOPTS=-j6 emerge --jobs=2 would be the best utilization over MAKEOPTS=-j4 emerge --jobs=3 by far.  I see portage go single threaded often and that kills throughput.

----------

## sublogic

 *eccerr0r wrote:*   

> BTW I'd say MAKEOPTS=-j6 emerge --jobs=2 would be the best utilization over MAKEOPTS=-j4 emerge --jobs=3 by far.  I see portage go single threaded often and that kills throughput.

 Isn't that backward ?  "-j4 --jobs=3" would run 3 threads at a minimum and 3, 6, 9 or 12 threads in general;  "-j2 --jobs=6" would run 2 threads at a minimum and 2, 7 or 12 in general.  Hmm, how about "-j2 --jobs=6" ?

As for portage going single-threaded often, that's Amdahl's law isn't it?  When forced to go single-threaded it also slows down, and gives you time to notice.

----------

## grknight

 *sublogic wrote:*   

>  *eccerr0r wrote:*   BTW I'd say MAKEOPTS=-j6 emerge --jobs=2 would be the best utilization over MAKEOPTS=-j4 emerge --jobs=3 by far.  I see portage go single threaded often and that kills throughput. Isn't that backward ?  "-j4 --jobs=3" would run 3 threads at a minimum and 3, 6, 9 or 12 threads in general;  "-j2 --jobs=6" would run 2 threads at a minimum and 2, 7 or 12 in general.  Hmm, how about "-j2 --jobs=6" ?.

 

MAKEOPTS depends on the number of emerge jobs running.. so an emerge --jobs 3 would produce up to three threads of emerge.

Then, MAKEOPTS, using your -j4 example, can start make (in compile phase) within each emerge job such that it would be 1-4 make threads for one emerge, 2-8 make threads for two emerge, up to 3-12 make threads for three emerge operations (assuming all are in the same phase)

----------

## eccerr0r

I don't know what rules Portage uses but my attempts at building rust and some other C/C++ package (through distcc) were all met with futility.  It would build only rust/cbindgen by itself and ignore the other packages that could be submitted at the same time because they were going to be distcc anyway.

Perhaps it's just due to the scheduling pulses, and it cannot start jobs when one package is running, all slots need to be open before it will start new jobs?  Not sure.

----------

## logrusx

 *eccerr0r wrote:*   

> 
> 
> BTW I'd say MAKEOPTS=-j6 emerge --jobs=2 would be the best utilization over MAKEOPTS=-j4 emerge --jobs=3 by far.  I see portage go single threaded often and that kills throughput.

 

Portage won't go single threaded. It might go single jobbed when everything else depends on the package being emerged. The package build itself can go single threaded for the same reason  - then everything else is dependent on what's currently being compiled or when the build scripts just don't support multi threaded build.

 *eccerr0r wrote:*   

> * The real test is to run 12 threads on your HT processor with HT turned on and off in BIOS.

 

That's not necessary. You can just turn off smt: 

```
echo off > /sys/devices/system/cpu/smt/control
```

or with smt=1 or nosmt kernel parameters.

Furthermore - HT works. Anyone remember we needed to patch the kernel with Con Kolivas patches to achieve decent responsiveness in the past? Now it's not necessary primarily because of the large number of hardware threads available.

Regards,

Georgi

----------

## eccerr0r

 *logrusx wrote:*   

> Portage won't go single threaded. It might go single jobbed when everything else depends on the package being emerged. The package build itself can go single threaded for the same reason  - then everything else is dependent on what's currently being compiled or when the build scripts just don't support multi threaded build.

 

Job/Thread, doesn't matter, only one line of execution in portage is running at the time and that's the rust job - which itself will have multiple threads as it builds.  I've been able to manually trigger another job while rust is building, so something is clearly up with portage not wanting to run something in parallel for whatever reason.  Now that I think of it, perhaps it hit load average limit and that's why it didn't submit more jobs, which is a pity because a distcc job isn't as taxing as a regular gcc job for the building host.

Then again there are other things that would be nice to control within rust - namely, the llvm portion and the rust portion.  The llvm portion can be distcc'ed but rust cannot.  Likewise with gcc, stage 1 technically can be distcc'ed...

--edit--

actually no, it could not have hit load average limit as I only supplied --jobs=2 - not load average limits except in MAKEOPTS, yet it still did only ran rust and nothing in the other slot.  Back to square 1 as to why it didn't stuff it with something -- not everything is dependent on rust...

----------

## logrusx

 *eccerr0r wrote:*   

> 
> 
> Job/Thread, doesn't matter

 

Don't mix meanings. When I said single jobbed I said it in the context of number of emerge jobs.

This is single jobbed:  *eccerr0r wrote:*   

> only one line of execution in portage is running

 

This is multi threaded:   *eccerr0r wrote:*   

> which itself will have multiple threads as it builds

 

 *eccerr0r wrote:*   

> perhaps it hit load average limit and that's why it didn't submit more jobs

 

I already explained it to you - everything in the queue (the depgraph) was in (a) branch(es) dependent on rust being built, this is when portage goes single jobbed when run with number of emerge jobs explicitly set (defaults to 1). Essentially what you do with number of jobs is allowing portage to follow multiple independent paths in the depgraph. When they hit a common dependency, they stop and wait, if there's no other available independent (at least of that particular dependency) path in the depgraph.

It might look like that:

```

A   C    E     F

 \ /      /     /

  B     /    /

    \   |   /

      \| /

       D

```

A, C, E and F may build in parallel, but D won't build until B is built. Add a dependency from B to E for example because of a use flag and E won't be built until B is built, only allowing A, C and F to build in parallel. Add another to F and you get the situation where portage goes single jobbed.

Regards,

Georgi

----------

## eccerr0r

Again it doesn't matter what you call it in this respect - one job of portage runs as its thread.  Each program have its own threads and don't forget that this is not qualified with lightweight or coarse thread.  There is no mixing of meanings, only your abuse of terminology on how *you* want each to be understood.

And there is no dependency issue here.  If everything depended on rust, fine, rust can block everything else from completing.  However there are people who built systems just fine with it in packages.mask.

In any case the mere existence of packages that end up causing portage to only let only one of its threads to run, means that each thread needs to spawn enough threads to occupy all processors - else the machine will be underutilized.

----------

## logrusx

Do you claim this is a bug in portage? If so you may as well file it.

Regards,

Georgi

----------

## eccerr0r

I don't understand the complete details of this issue, just that I've noticed it running only one merge for extended periods of time, specifically while building large packages.

----------

## logrusx

I've been having the same concerns for a while. Some more knowledgeable people than me and you explained what I explained above. If you still think this is an issue, you better investigate and file a bug. You may as well learn something new. I have nothing more to add about that.

Nowadays, when I see a lot of python, perl or haskell packages scheduled for merge, I increase the emerge jobs to the number of logical cores and decrease make jobs to 1 and when they finish building I revert back to defaults, 1 emerge job, #ofcores threads that is.

If I'm not around the computer at that time, I don't bother with the above.

p.s. when you ask questions on forums, you might as well get some answers you don't like. You can get corrected too. If you resist that you're not learning anything new and in essence, not doing anything useful.

Regards,

Georgi

----------

## eccerr0r

And likewise people make "corrective" posts that are actually wrong.  Believe me I've seen a lot of shit posts over the years that were asking to be re-shitted as they don't show anything new and only tries to make them look like they know something.

In any case there are packages that I wish had better control of its job control, specifically for distcc use.  I don't know what percentage of the population use distcc, plus the fact distcc often fails and ends up hurting build times.  However distcc is offtopic and the main concern here is how portage's parallel mode by itself does not provide enough tasks to keep all cores or SMT threads from going idle, and if they do go idle - whether from being starved or waiting due to swapping - SMT won't give you that extra 14%.

There are a lot of places where parallel optimizations would be nice when running portage -

- Calculating dependencies.  This has long been an ugly single thread task that blocks all else

- There is some cleanup tasks that portage is doing in the install and subsequent phases that have been taking longer and longer to run - most noticeable on slow computers, and these are also single threaded.  Since this is per job, this can be hidden by running multiple jobs in parallel if you have a multi core machine.

- And the aforementioned refusal to start additional jobs where it seems like it should be able to.  If there really is a bottleneck then it seems like there should be a way to reorganize the jobs such that more can be done in parallel - one, to hide the single thread issues, and two, if you have distcc help, can utilize other cores on other machines.

Yes portage is doing a very, very dirty job (I call all NP O(n^2) and worse tasks"dirty") and I'm glad it's doing what it's doing.  But people can always ask for more...

----------

## Naib

not the best test to prove any benefits of hyperthreading...

if you run x number of tasks that all require similar capabilities from a process (ie compiling) then there won't be any benefits. Hyperthreading is a logical trick to utilise otherwise unused SIMD. So if you run comparable processes how can it improve performance?   now if you were to run an orthogonal process (say some multimedia ) thats when it starts to benefit as the instructions for multimedia are typically not used for compiling.

----------

## sublogic

 *Naib wrote:*   

> not the best test to prove any benefits of hyperthreading...

 Many years ago, at work, we tested hardware intended for floating-point intensive computations; hyperthreading was a slight negative.  Would current hardware perform differently ?  We'll never know.  That's water under the bridge now.

Here I wasn't trying to prove anything, just decide whether to enable symmetric multithreading (SMT) or not.  What I read over the years (on the Internet so it must be true, right?) is that, especially with SMT, you can't rely on published benchmarks and  that you should test the software you intend to run.  No two Gentoo installs are the same.  What consumes a bunch of CPU cycles on all of them ?  emerge !

I restricted the test to the "ebuild compile" phase to increase the signal-to-noise ratio.

 *Naib wrote:*   

> if you run x number of tasks that all require similar capabilities from a process (ie compiling) then there won't be any benefits.

 Okay.  Show me.

 *Naib wrote:*   

> Hyperthreading is a logical trick to utilise otherwise unused SIMD. So if you run comparable processes how can it improve performance?

 Uh, SIMD is "single instruction, multiple data", the ultimate in "comparable".

 *Naib wrote:*   

>   now if you were to run an orthogonal process (say some multimedia ) thats when it starts to benefit as the instructions for multimedia are typically not used for compiling.

 Okay.  Show me.

----------

## eccerr0r

I believe SIMD and possibly regular floating point operations go through the same pipe nowadays on modern machines so these won't fly together.  However SIMD and standard "SISD" ALU ops do go through different pipes and can run together.

I think the main thing that was ignored is that gcc has a lot of different functions in it.  Yes most is integer "SISD"-type instructions, branches, and a lot of load/store operations that give opportunity for the sibling thread to make forward progress on a cache miss.

I would think that carefully written heavy FPU code will not benefit from SMT because they were already well optimized to take out any gaps, though as said if you run orthogonal applications that don't use FPU code, you may well better utilize the CPU and overall throughput would be higher - but only if you had to also run that other application.

For random desktop computers that run random unoptimized code, it's likely you will hit orthogonality and get a bit better instruction throughput than just running one code stream at a time.

----------

## sublogic

 *eccerr0r wrote:*   

> There are a lot of places where parallel optimizations would be nice when running portage -
> 
> - Calculating dependencies.  This has long been an ugly single thread task that blocks all else

 Yeah, that bugs me too, but it looks like a hard problem.  Portage has to build a dependency graph, complicated by the swarm of USE flags by and incompatibilities between multiple versions of a given package, then detect cycles and build an ordered list of packages to rebuild (or a tree, if --jobs=more_than_one).  A duckduckgo search on "parallel graph algorithms" pulls some literature, but that still looks hard.  A rewrite to a multithreaded algorithm would require solid testing to make sure nothing broke.  Then there is the issue that portage is written in python, and python allows only one thread at a time to execute bytecode (the global interpreter lock, or GIL).  So you have to use subprocesses instead of threads, and hope that the pickle/unpickle penalty doesn't kill you.  Or write extensions in C that release the GIL (gasp).

 *eccerr0r wrote:*   

>  - There is some cleanup tasks that portage is doing in the install and subsequent phases that have been taking longer and longer to run - most noticeable on slow computers, and these are also single threaded.  Since this is per job, this can be hidden by running multiple jobs in parallel if you have a multi core machine.

 If there are no dependencies between the cleanup tasks and subsequent jobs.  Basically break up jobs at a finer granularity so that subtasks could be interleaved.  That seems more doable, but I have spent zero minutes looking at the portage code and I have no clue how much of a refactor that would be.

----------

## eccerr0r

TBH I'm thinking portage is doing something during the install phase that's either unnecessary or suboptimally, but nobody has figured out what it's doing that's so "slow" - to me it seems like it's the main reason why it takes 30 seconds or more to install a virtual package on a slow machine.

I suppose it's up to me to figure out what it's doing...

----------

## logrusx

There is no parallel algorithm for what you're talking about. There just isn't and that's been discussed endless times. It just doesn't work with searching, somebody must actually do the job. If you discover such an algorithm you might as well be eligible for huge prices.

Regarding the increase time to install a virtual, this might actually be a real issue and there was a thread about it, but it's most probably berried deep.

Actually it was created by you, eccerr0r and you didn't follow up.

Regards,

Georgi

----------

## Zucca

So I wonder if just passing nosmt mitigations=off to kernel might keep one "safe enough".

Aren't all these processor security flaws found in recent years all because some way SMT/HT leaks data?

@sublogic: More tests for you to run? ;)

----------

## eccerr0r

SMT "leaks" data the same way that spectre/meltdown "leaks" data.

If you trust your software, it's fine to unleash and let it run to its full potential.

I just built a new kernel for my Atom multithreadded CPU.  I disabled all the kernel spectre/meltdown related features, it took 20 hours to compile a 5.15.59 kernel with -j2.  Not sure if I'm seeing things but it seems a tad faster without all these mitigations that it doesn't need.

TBH the SMT mitigations may still be needed on the Atom even if the other out of order stuff isn't needed? hmm.

----------

## sublogic

 *Zucca wrote:*   

> So I wonder if just passing nosmt mitigations=off to kernel might keep one "safe enough".
> 
> Aren't all these processor security flaws found in recent years all because some way SMT/HT leaks data?
> 
> @sublogic: More tests for you to run? 

 Like I said, I'll post the scripts if there is interest.  :Wink:  :Wink: 

----------

## eccerr0r

Currently doing my belated quarterly updates...

```
>>> Emerging (270 of 839) dev-db/mariadb-connector-c-3.2.5::gentoo

>>> Emerging (271 of 839) dev-lang/rust-1.62.1::gentoo

>>> Installing (270 of 839) dev-db/mariadb-connector-c-3.2.5::gentoo

>>> Jobs: 269 of 839 complete, 1 running            Load avg: 11.0, 10.4, 8.4
```

Dangit still just running one job and didn't start another after mariadb-connector-c finished.  At least this is my fast machine, though it would be nice if my others could assist on forward progress on other packages at the same time.  Still an hour out before rust finishes.

Looking at my merge logs, the package that usually installs immediately after dev-lang/rust is virtual/rust, cargo, and/or cbindgen which is expected.  But after that, as a sample:

app-admin/sudo

x11-misc/shared-mime-info

x11-base/xorg-server

media-libs/gexiv2

app-emulation/winetricks

Note these were from different sessions on different years (quarters), not all during one merge session.  As far as I know, none of these depend on rust and could have ran at the same time as rust?  If it was only firefox or librsvg I'd agree they should wait until rust finishes, but not winetricks?

--- new observations ---

My current observation is that portage with --jobs will launch whatever number you set in "--jobs" jobs in parallel and won't launch more until all those jobs complete... if this is the case, then this is why I'm seeing a lot of "dead" time here...can anyone confirm?

(Of course a full scheduler should notice one come back and send off another one in that slot, but perhaps potage isn't smart enough to know which ones are okay to preemptively submit, potentially one could be submitted out of order and that's why it's holding off.  But it should have known which ones were okay to submit in parallel in the first place.  Perhaps it should have computed parallel jobs with --jobs (infinite) and used that data to submit jobs as they complete?)

--- new counterexample ---

I just noticed webkit-gtk running, another long running build.  Portage is submitting jobs in parallel to webkit-gtk.  So something is not quite understood here...

----------

## Amity88

@eccerr0r 

I doubt that portage waits for the first set to finish before starting the next set of parallel jobs, because I've seen the number got up and down before the set finishes.

@sublogic

Thank you so much for doing this! this is very useful data.

----------

