# gcc using only system time?

## palettentreter

Hi everyone,

I've been observing strange behavior on a KVM virtual server. The VM has 16 GB RAM and 32 CPU cores, the iron has 128 GB and 32 cores * 2 HT. Now when I'm putting it to maximum load by compiling with MAKEOPTS="--jobs=32 --load-average=32", almost all gcc cpu time is displayed as "system" time:

```

Bootup: Wed Jun 19 15:00:59 2013   Load average: 31.70 31.78 31.84 33/484 15334

user  :            00:00:00.00   0.0%  page in :                0

nice  :            00:00:00.11   0.3%  page out:                0

system:            00:00:31.97  98.9%  page act:                0

IOwait:            00:00:00.00   0.0%  page dea:                0

hw irq:            00:00:00.00   0.0%  page flt:             2214

sw irq:            00:00:00.00   0.0%  swap in :                0

idle  :            00:00:00.20   0.6%  swap out:                0

uptime:   1m 3w 6d 02:30:45.54         context :              178

```

(1 second sample)

Note the very low number of context switches and page faults, so I'd suppose I'm not over-loading anything. Normally, I'd expect the compiler to show up as nice time since portage runs nice @20. I haven't done proper measurements on overall compile time yet, but my gut tells me I'm not getting the speed boost I'd expect from 32-fold parallelism, as opposed to, say, 8-fold on my laptop. When I run an infinite loop in bash, I have one processor at 100% user time, just as expected. Can anybody give a plausible explanation for this?

----------

## palettentreter

ok, further investigation now leads me to believe that the high parallelism is bottlenecking somewhere in the IO stack. When I look at how much time compiler processes spend in which libc function, I see things like this:

```

# x=(`pgrep cc1plus`); strace -c -p $x

Process 28104 attached

% time     seconds  usecs/call     calls    errors syscall

------ ----------- ----------- --------- --------- ----------------

 76.13   36.210482        6289      5758           lstat

 22.62   10.760558        2072      5194      4744 stat

  1.00    0.477682        1059       451           open

  0.24    0.113598        3664        31           readlink

  0.00    0.000045           0       451           read

  0.00    0.000000           0        50           write

  0.00    0.000000           0       452           close

  0.00    0.000000           0       451           fstat

  0.00    0.000000           0      1261           mmap

  0.00    0.000000           0       905           munmap

  0.00    0.000000           0        83           brk

------ ----------- ----------- --------- --------- ----------------

100.00   47.562365                 15087      4744 total

```

The time spent in lstat() calls looks a little high to me. Then when I check what qemu is doing on the iron, I see that it does lots of expensive ioctl()s:

```

% time     seconds  usecs/call     calls    errors syscall

------ ----------- ----------- --------- --------- ----------------

 81.24  649.526022     1187433       547           ioctl

 12.12   96.918047      161261       601        34 futex

  3.32   26.559599       10519      2525           poll

  2.28   18.229465        3593      5073           select

  0.67    5.380052     1076010         5           restart_syscall

  0.14    1.095292         141      7775      2679 read

  0.14    1.086699         133      8147           write

  0.03    0.261974         179      1461           timer_gettime

  0.03    0.259766         178      1459           timer_settime

  0.02    0.171606         122      1407           rt_sigaction

  0.00    0.017663        2208         8           fdatasync

  0.00    0.012211         581        21           pwrite

  0.00    0.008879         634        14           pwritev

  0.00    0.003450         108        32        32 rt_sigtimedwait

  0.00    0.003345         105        32           rt_sigpending

------ ----------- ----------- --------- --------- ----------------

100.00  799.534070                 29107      2745 total

```

Right now I'm suspecting this may be due to the volume being accessed via TCP ISCSI. I'll try if I get different results when I put /var/tmp/portage on a local volume...

----------

## palettentreter

Just looked at the results with /var/tmp/portage on a local disk. The lstat() calls on the local disk aren't the least bit faster than those on the ISCSI disk, so there's probably no bottleneck there. Guess I should really do proper time trials to see if I'm actually slowing anything down or if it's just weird or even natural for compilers to use mostly system time.

----------

## Maitreya

With such a machine you probably have *loads of memory.

Why not put /var/tmp/portage on tmpfs?

----------

## palettentreter

 *Quote:*   

> 
> 
> Why not put /var/tmp/portage on tmpfs?
> 
> 

 

Many of those expensive lstat() calls also go to include files and the like in /usr/*. So even if putting /var/tmp/portage on tmpfs helped, I wouldn't go so far as to put / on tmpfs  :Wink: 

Also, I don't think the storage implementation is the bottleneck here, since as I said, switching from ISCSI to local disk didn't help at all.

----------

