# Massive multi-threading - limits ? [RESOLVED]

## clampinus

After buying a new box (Core i7, 12 GB RAM) and installing Gentoo 64 bits I tried to evaluate how much threads I could run simultaneously on it. So I wrote a little C program launching thousands of threads (listing below).

To my disappointment the system becomes unable to allocate anymore memory after about 33000 threads are launched (yes, that's about half of 65536 !) :

```
~ $ ps

-bash: fork: Cannot allocate memory

```

I am far from using up my RAM and CPU (the threads mostly sleep and run a logarithm calculation once in a while). On the other hand running pmap on the process reports that about two virtual memory maps per threads are allocated.

```
~ $ pmap 1977 | head

1977:   ./_ThreadingMini

0000000000400000      4K r-x--  /home/bulle/workspace/general/c-src/test/_ThreadingMini

0000000000600000      4K r----  /home/bulle/workspace/general/c-src/test/_ThreadingMini

0000000000601000      4K rw---  /home/bulle/workspace/general/c-src/test/_ThreadingMini

0000000000602000   9108K rw---    [ anon ]

00007ef99a2c2000      4K -----    [ anon ]

00007ef99a2c3000   8192K rw---    [ anon ]

00007ef99aac3000      4K -----    [ anon ]

00007ef99aac4000   8192K rw---    [ anon ]

00007ef99b2c4000      4K -----    [ anon ]

~ $ pmap 1977 | wc -l

64028

```

I tried changing vm.max_map_count to something big (2 million) but the problem is the same. Some sources reports that this variable is for per-process allocation anyway, whereas my problem seems to be kernel-wide, as trying to start any process fails once the threading process is up.

I also tried reducing the amount of stack allocated to each thread (to 16k, down from the default 8m glibc allocates), it did not solve anything.

Is there any kernel parameter (compile-time or dynamic) that defines how many memory maps can be allocated kernel-wide ?

Here is my threading C program :

```
#include <stdio.h>

#include <stdlib.h>

#include <pthread.h>

#include <unistd.h>

#include <math.h>

#define NbThreads 33000

#define NbTicks 100000000

void *my_thread(void *arg)

{ long i;

  volatile double x;

  

  for ( i = 0 ; i < NbTicks ; i ++ )

  { x = log(i * 1.0);

    usleep (2000000);

  }

  pthread_exit (0);

}

int main(void)

{ pthread_t threads[NbThreads];

  long j;

  void *ret;

  

  for( j = 1 ; j <= sizeof(threads) / sizeof(threads[0]) ; j ++ )

  { if(pthread_create (&threads[j - 1], NULL, my_thread, (void *) j) < 0)

    { fprintf(stderr, "pthread_create error for thread %ld\n", j);

      exit (1);

    }

    if(j % 1000 == 0)

      printf("%ld launched\n", j);

  }

  for( j = 1 ; j <= sizeof(threads) / sizeof(threads[0]) ; j ++ )

    (void)pthread_join (threads[j - 1], &ret);

  return 0;

}

```

Last edited by clampinus on Sun Dec 20, 2009 8:32 am; edited 1 time in total

----------

## fikiz

I'm not an expert in this area, but it seems to me that the maximum PID number is 32767 and a PID is assigned to each thread (not only to processes). Maybe I'm going wrong.

----------

## theotherjoe

clampinus, pthread_create() man-page states that

function returns != 0 in case of error. 

Tried your example with corrected if-condition:

```
15000 launched

16000 launched

pthread_create error 11 for thread 16351

```

which is sort of inline with ulimit:

```
 ~ $ ulimit -a

core file size          (blocks, -c) 0

data seg size           (kbytes, -d) unlimited

scheduling priority             (-e) 0

file size               (blocks, -f) unlimited

pending signals                 (-i) 16383

max locked memory       (kbytes, -l) 64

max memory size         (kbytes, -m) unlimited

open files                      (-n) 1024

pipe size            (512 bytes, -p) 8

POSIX message queues     (bytes, -q) 819200

real-time priority              (-r) 0

stack size              (kbytes, -s) 8192

cpu time               (seconds, -t) unlimited

max user processes              (-u) 16383

virtual memory          (kbytes, -v) unlimited

file locks                      (-x) unlimited

```

was able to play around with /proc/sys/kernel/threads-max and

ulimit -u to increase the number of threads when running as root.

havent found a way to extent max user processes for normal user.

----------

## Mad Merlin

As mentioned already, threads get a pid as well as processes. IIRC there's a compile time option in the kernel to increase the maximum number of pids which would probably help here.

----------

## clampinus

Sorry for the late answer, the notification went down the anti-spam filter !

Thank you for this helpful information, it certainly makes sense. I will try to have a look at that this week-end.

----------

## clampinus

By setting /proc/sys/kernel/pid_max to 4000000 and /proc/sys/vm/max_map_count to 10000000 and running as root I could push the amount of threads to 212733 (around twice 106496, which seems to be a usual limit on my kernel, found through ulimit -a). I guess I have hit another limit there, I will try to find out which.

I have also fixed the error reporting when launching threads, much more convenient. Thank you theotherjoe.

----------

## clampinus

Next limit is the maximum threads supported by the kernel (looks like it is system-wide, not per-process from my tests, but that would need confirmation). So I have set /proc/sys/kernel/threads-max to 10000000.

Then I think I hit a limit because of my central memory (the memory perf meter is really up there...) once I reach around 470000 threads. This is quite possible, as each thread at least allocates one or two segments of 8 kB stack. I also launched a process allocating 8 GB of RAM (with calloc) before my threading process and I could only reach like 100000 threads, which kind of confirms the RAM limitation. So I guess I am done with my threads toying.  To sum up the changes I have done were :

/proc/sys/kernel/pid_max : 4000000

/proc/sys/kernel/threads-max : 10000000

/proc/sys/vm/max_map_count : 10000000

That is how I reached a maximum of around 470000 threads (sleeping) with my configuration, probably because I hit the physical RAM limit.

I will try to edit the title as RESOLVED.

----------

## ArmorSuit

This is all vague since you have a practical limit which is when context switching becomes significant in comparison to time slices allocated per thread for actual thread work. Depending on the type of work, and I/O wait per thread, the "sweetspot" can be anywhere from 2 threads per core up to dozens, maybe hundreds per core if you have hyperthreading and practically no I/O wait, but definitely not thousands nor tens of thousands.

That is the very reason why threaded servers (one thread per connection) fail in comparison to advanced epoll or kqueue based solutions. Especially because those have high I/O wait so threads mostly wait on data, and still they choke when hundreds of threads/connections (per cpu) are engaged. If those were cpu-bound, their effective number per core would be even less.

----------

