# Testing your CFLAGS

## _never_

Testing your CFLAGS

I have found many introductions to CFLAGS and suggestions on how one should set them up for what kinds of architectures. What I'm providing here is a way to actually test your CFLAGS. For this purpose I have written a small test program in C. Don't ask for any sense. It's just a test program written in a way that compiler flags have a noticable influence.

1. Preparing

First create some test directory and change to it:

```
mkdir ~/cftest

cd ~/cftest
```

Now you need to save the source code of the test program. For that type:

```
cat > cftest.c
```

Then paste the following into the terminal:

```
#include <math.h>

#include <stdio.h>

#include <time.h>

#define ITERATIONS (64)

#define USIZE (sizeof(unsigned) * 8)

struct teststruc {

  double a;

  float b;

  unsigned long long c;

  unsigned d;

};

int func2(char *str, size_t strl) {

  int i, k;

  short int f = 1;

  for (i = 0; i < 16; i++) {

    f += str[i % strl];

  }

  for (i = 0; i < strl; i++) {

    for (k = i; k < 64; k++) {

      str[k % strl]++;

    }

    for (k = 23; k < 35; k++) {

      str[(k * 3) % strl]--;

    }

    for (k = 14; k < 29; k++) {

      str[k % strl]--;

    }

    f += str[i] * str[(i + 1) % strl];

  }

  return (f >= 0 ? f % 13 : -f % 13);

}

double func1(double res, double *ref) {

  struct teststruc t;

  double tmp1;

  unsigned long long tmp2;

  unsigned rv;

  int i, k;

  t.a = 0.6532;

  t.b = 1;

  t.c = 13;

  t.d = time(0);

  for (i = 0; i < 5273; i++) {

    for (k = 0; k < 16; k++) {

      t.a *= 1.1212;

    }

    t.a += t.b * *ref;

    t.b *= res * 52.3734;

    t.c += t.a * t.b * res;

    tmp1 = t.a;

    t.a = fmod(t.b, 62.63);

    t.b = fmod(tmp1, 74.12);

    for (k = 0; k < 16; k++) {

      t.a *= 1.4141;

    }

    rv = 1 + t.c % 26;

    t.d += i;

    tmp2 = t.d >> rv;

    t.d <<= (USIZE - rv);

    t.d |= tmp2;

    t.a *= func2((char *)&t.c, sizeof(t.c));

  }

  return t.a * t.b + t.c + t.d;

}

int main() {

  double res = 1.541;

  double ref = 2.631;

  int i;

  for (i = 0; i < ITERATIONS; i++) {

    res = func1(res, &ref);

    res = fmod(res, ref);

  }

  printf("%lg %lg\n", res, ref);

  return 0;

}
```

After pasting press the return key and then the EOF key (Ctrl+D on Intel machines). If it doesn't give back a command prompt, press EOF key again. If it still doesn't press the interrupt key (Ctrl+C on Intel machines). Then compile the program:

```
gcc -lm -o cftest cftest.c
```

If you don't get any output, then everything went fine. Now it's important that you don't have any system load. Close all background programs like P2P software. The less system load you have, the more accurate the timing is. I need to say that even moving the mouse or typing with the keyboard needs a lot of CPU power. While running the test program, don't do anything else. Don't even move the mouse.

Run the test program with:

```
time ./cftest
```

When it finishes it prints two numbers and timing statistics below them. You can ignore the two numbers, but the times are important. Intentionally it will take some time to finish. If it takes less than 20 seconds, open the source file ~/cftest/cftest.c with your favorite editor and increase the value of ITERATIONS from 64 to something higher. If it takes very long (like more than one minute), decrease it (you can abort the program with the interrupt key: Ctrl+C on Intel machines). Then compile it again and see how long it takes using the two commands above.

Now what you really need is the user time. It should be nearly equal to the real time. If it isn't, then you have too much system load. Remember: don't even move the mouse while running the test program. In bash it looks something like this:

```
$ time ./cftest 

0.802501 2.631

real    0m8.954s

user    0m8.700s

sys     0m0.022s
```

Save the user time somewhere, like in a text file. When this is done, close the editor if it's still open and don't change the source code again.

2. Running the tests

What you will try here is to increase program speed and therefore decrease the time it takes to finish. Set some CFLAGS to test with the following command:

```
CFLAGS="<flags>"
```

Of course I don't mean the CFLAGS-line in your make.conf file. Just enter this command into your shell. Example:

```
CFLAGS="-O2 -fomit-frame-pointer"
```

Then type the following:

```
gcc -lm $CFLAGS -o cftest cftest.c && time ./cftest
```

Now compare the user time with the one you got earlier. You can experiment with other CFLAGS. Just set them with the command above and run the test again. Try to decrease the user time as much as possible. If a flag doesn't make any difference, don't use it (most flags slow down the compiler). However, there is a little variance. Take into account only the first or the first two fractional digits of the user time or run a test with the same flags multple times.

Here are some flags you can experiment with:

 -O1, -O2, -O3 (only one of them)

 -mmmx, -m3dnow (if your machine supports them; check the flags in /proc/cpuinfo)

 -msse, -mfpmath=sse (if your machine supports SSE; check the flags in /proc/cpuinfo)

 -march=<architecture> (example: -march=pentium3)

 -fomit-frame-pointer (you will want to include this anyway)

 -funroll-loops (will make noticable difference, but produces large binaries)

 -maccumulate-outgoing-args (produces large binaries without really noticable difference in speed)

 -ftracer (doesn't seem to do anything; it's a rather new flag - don't use it)

The -pipe option does not produce faster code. It just speeds up the compilation process, so include it.

3. Even more flags (advanced users)

If you are an advanced user, you can find additional flags with descriptions in the GCC info pages with the following command:

```
info gcc
```

Position the cursor on the text "Invoking GCC::" and press the return key. Then scroll down, position the cursor on "Optimize Options::" and press the return key again. What you get then is a list of C compiler flags (CFLAGS) related to code optimization. If you want to go even further, you can set submodel specific options. Press the L key and then select "Submodel Options::". Choose your submodel there ("i386 and x86-64 Options::" for most users). You can leave the info program with the Q key.

More advanced users (like programmers) can even look for other options in the info pages. Some of them are specific to C++ (those about templates and classes). You will never want to set C++ specific flags globally in your make.conf. However if you really know what you are doing and want to set them anyway, they belong to the CXXFLAGS in your make.conf, not to the CFLAGS. You cannot test C++ flags with this test program as it's a C program. If you wanted to set a C++ flag "-fsome-flag" in your make.conf, it should look like this:

```
CXXFLAGS="$CFLAGS -fsome-flag"
```

However, you should not set any compiler options besides optimization flags and submodel options. If you have found a set of flags that produces fast code on your architecture, take them into your make.conf file.

4. Applying the new CFLAGS

After changing your make.conf, your system will still use the old flags until you recompile it. If you don't want to recompile the entire system, my suggestion is to recompile the following packages (in that order):

 gcc

 glibc

 binutils

 coreutils

 xorg-x11 (if you use it)

 gtk+ (if you use it)

 qt (if you use it)

Then recompile all focus programs like your window manager, programs you work with and so on. Another way to apply the new CFLAGS to the most important packages:

```
emerge -e system
```

This will take some time. If you are even more patient, you can recompile your entire system:

```
emerge -e world
```

If you have lots of packages installed, you will be waiting a few days; especially when you have lots of KDE/Qt programs installed.

5. Removing the test program

When everything is done, delete the directory of the test program:

```
cd ..

rm cftest/cftest*

rmdir cftest
```

6. Enjoy your new CFLAGS

----------

## adaptr

Okay, I've given it a go, and have some troubling data to report.

First, some info:

Pentium-3 1000MHz / 256MB PC133 / ASUS CUSL2-C

gcc 3.4.3 NPTL / kernel 2.6.10-r6

My portage flags are -O2 -march=pentium-3 -fomit-frame-pointer -pipe; the entire system was built with them (stage 1 on 3).

All tests were compiled with gcc -v to check the flags in use.

Each test was run 3 times to eliminate momentary differences.

I started out with no flags, got 14.5 seconds user time - NOTE gcc defaults to -mtune=pentiumpro, probably because CHOST=686 and it copies that, but I'm not sure.

Next I put in my make.conf flags, reduced to 10.2 seconds user time - almost 30% less, amazing really

Next I thought, okay, but can I reduce it further ?

So I put in -O2 and ran it 3 times again - still 10.2 seconds...

Okay, nudge it up to -O3 ... nope, still 10.2 seconds, all with minor variations of less than 0.05 seconds.

This was strange, so I decided to test it more thoroughly.

First, strip all flags again, and this time force it to build for i386 - back up to 14.5 seconds, the same as for pentium-3.

Add the -O flag back in, test again - 10.2 seconds again!

Exactly the same test time, without any optimisations above the 386!

Getting weirder now... I started putting in extra flags, that according to the gcc man page should half break my system and half actually increase the run time: -mfpmath=sse, -fforce-addr, -funroll-all-loops, and some more stuf I forgot...

Shortening a confusing story considerably, my conclusions are worrying to say the least:

- no matter what extra optimisation flags I set, the run time is 30% less with -O than without it.

- increasing the optimisations to -O2 or -O3 has a negligible effect compared to that 30% speed increase.

- I saw no difference at all from any processor-specific optimisations, nor did enabling mmx or sse give any noticable performance gains

This I find most puzzling, especially since a good part of this test code is mathematical.

The only sensible explanation I can find for these results is one you won't liek one bit: the test is biased or flawed to such an extent that you can not actually use it to test optimisations beyond the basic on or off, -O1 or nothing.

Even so, nice to see that just adding that flag gave me a 30% boost - but again, probably only for this highly selective piece of code...

Comments ?

----------

## Rufy

I've had similar results here with similar hardware (Pentum3 1ghz 512mb).  In fact, so far my tests have shown the best combination is with "-O1 -march=pentium3 -fomit-frame-pointer", and adding additional flags has done no better.

I've run tests like this before, but with real applications like gzip and povray, and have seen very different results.  So, without a deeper analysis of the code above, I'd say it's simply an isolated case of "less is more".

Also, optimizing applications like GCC and X involves more than testing for speedy math code.  With GCC you're doing lexical analysis, table lookups, and other fun stuff.  With X you've got tons of hardware-level calls going on, all dependent on how well supported your driver happens to be (and those of us running nvidia are stuck with their CFLAGS for the most part anyway, so not much to improve there).

A better approach to CFLAGS testing would be to build a test-bed that compiles and runs a predefined set of applications using various CFLAGS settings.  Include apps that you would normally use in any given session for more accurate results.

----------

## adaptr

Shouldn't gcc itself have a benchmark suite - or at least a conformity testing suite ?

ADDENDUM:

After thinking about this some more, this isn't actually all that simple to achieve...

Since to give a true benchmarkable solution one would have to re-flag every single dependency as well - including glibc! - it will be non-trivial to set up and execute.

Even if every library were built with the same set of flags, this would only be representative for that target system, since other systems might be built another way.

So either one would have to build one monolithic suite that is not dependent on any libraries - almost impossible I think - or a certain set of flags would be requisite for the pre-existing system to test against.

This would almost certainly require a distribution built specifically for this purpose, to enable people to fire up a CD and start testing their hardware with a wide range of flags.

Another option would be to incorporate the timing in the test suite itself; the flags the libraries were built with become less of an issue if just the internal portions of the test suite code are benchmarked.

----------

## _never_

Well, that's what I meant. Mostly you won't notice any difference. The code is written in a way that most any compiler flag can be tested, but unfortunately most flags cause little difference. -O1 and -O2 are just simple aliases for a hell of a lot flags, which you could also enable by hand. If you test them individually, the speed gain is very low. The only flag that causes a considerable speed boost is -funroll-loops, but it has the drawback of increasing code size a lot.

As far as I've seen, the -mmmx and -m3dnow flags aren't really used, until you write your code to do so (see "vector variables" in the gcc info pages). -msse is similar, but you can force SSE to be used for floating point arithmetic, if you set -mfpmath=sse as well.

Submodel specific optimization (like the difference between pentium3 and 386) is really only to reduce cache fails, but the speed gain is not really noticable. You might notice it a bit if you use -fno-inline-functions and then test -march=386 vs. -march=pentium3. Here with my Duron, it does make a difference of less than 50 ms.

Well, this experiment should make clear, that insane optimization does not make such a difference. For example in Debian most packages are compiled with just -O2 and -fomit-frame-pointer and this is about the highest optimization level you can get without having binaries of twice the size. I've used to use -funroll-loops before, because it really makes code faster, but then I realized that my whole system runs faster without it. Now I enable it for speed-critical apps only, like mplayer or games.

----------

## _never_

 *Rufy wrote:*   

> Also, optimizing applications like GCC and X involves more than testing for speedy math code.  With GCC you're doing lexical analysis, table lookups, and other fun stuff.  With X you've got tons of hardware-level calls going on, all dependent on how well supported your driver happens to be (and those of us running nvidia are stuck with their CFLAGS for the most part anyway, so not much to improve there).

 

You are absolutely right. And I have tried to write a code that does most of those things. Table lookups are mostly just loops with string or integer comparison and so is lexical analysis. It sure has a lot more overhead than this little code, but it's essentially the same and you will get no different results. At least I didn't.

Maybe the best solution would be, that the maintainer of a package sets individual optimization CFLAGS for every package. Then the user does not set them anymore, but only sets submodel options (like -march or the like). If you still feel like having your own global optimiziation flags, you could still set them via make.conf/CFLAGS and they would override the maintainer's flags.

 *adaptr wrote:*   

> Since to give a true benchmarkable solution one would have to re-flag every single dependency as well - including glibc! - it will be non-trivial to set up and execute.

 

This is not true. If you don't use any external functions (and I didn't - the compiler inlines the math functions), then glibc isn't even touched by this program. You will still need to have glibc (or another libc), but the code itself runs without it.

----------

## adaptr

That's obvious nonsense - if your code does not need the math library then why does the linker want it ?

----------

## _never_

No, it is not nonsense. The glibc is used before entering main() and after leaving it (same as exit() - in fact, it's also the same technically). The math library is used, if you do not use -ffast-math. If I hadn't used math functions, then the actual test loop would run without the math library at all, but still not without glibc, because it is needed outside of the test loop.

----------

## adaptr

Okay.

I'm not disputing what you say (i'm sure it's correct) but take a moment to think about this:

- any meaningful benchmark for CFLAGS must take into account real-world conditions, i.e. consist of the kind of code that people actually run on their machine every day.

- any non-trivial, not-optimised-for-benchmarks program will spend up to 50% of its time in either one or more of the system libraries or in kernel calls - that's what they're for.

The key word in the above is meaningful - people tend to be impressed by raw figures of how fast your system can compute pi to a gazillion decimal places, but as a measure of your system's overall speed it's pretty useless.

Not benchmarking glibc and the kernel with your chosen CFLAGS is therefore unrealistic to say the least (I'm not gonna say silly  :Wink: )

So yes, to effectively benchmark a given, known system against a set of CFLAGS you will have to rebuild the system .

Anything less just isn't a representative benchmark.

(As an aside, I understand that this would be prohibitively costly to set up and run, especially when you want to test out multiple flag combinations in sequence... sort of rules it out for your idea of a quick'n'dirty CFLAG tester, eh ?  :Wink: )

----------

## _never_

 *adaptr wrote:*   

> (As an aside, I understand that this would be prohibitively costly to set up and run, especially when you want to test out multiple flag combinations in sequence... sort of rules it out for your idea of a quick'n'dirty CFLAG tester, eh ? ;-))

 

And that's the purpose. =)

Sure, it can't replace a system benchmark, but this test gives users an idea of how much a specific flag does change things. The stuff that gets optimized, is algorithms and iterations and so this is the most realistic quick'n'dirty test.

Well, the whole story about the CFLAGS - it's just a hype. Even worse than antivirus software and firewalls under Windows are. I had enough from nonsense like "Use -funroll-loops, it makes your system a lot faster", so I wanted to show what CFLAGS actually change. You cannot have much more than 25% speed gain over the unoptimized state and this is a maximum. In the real world, you won't even get that much.

----------

## adaptr

I'm not so sure of that - or, at least, I would start reading the gcc mailing archives for discussions about this, of which I can assure you there are plenty...

I'm having trouble accepting that the difference between -O0 and -O1 always gives the same improvement - regardless of any other optimisation flags.

Even more scepsis is had when I see from your code that using -O2 or -O3 actually decreases performance - every single time, again regardless of other flags.

They might as well not have bothered then, and yet the gcc code base is - like so many things GNU - jointly developed by hundreds of people.

I have a hard time believing nobody would have noticed this before now.

----------

## NewBlackDak

This is on the Athlon system clocked at default(cleaned my watercooling out, so it's on air until I get it back in this weekend)

CFLAGS=""

1.6531 2.631

real    0m6.832s

user    0m6.823s

sys     0m0.008s

CFLAGS="-march=athlon-xp -O3 -pipe -fomit-frame-pointer -ffast-math -mfpmath=sse,387 -msse -mmmx -m3dnow -fweb"

0.48502 2.631

real    0m5.282s

user    0m5.275s

sys     0m0.006s

CFLAGS="-march=athlon-xp -pipe -fomit-frame-pointer -ffast-math -mfpmath=sse,387 -msse -mmmx -m3dnow -fweb"

0.156335 2.631

real    0m6.842s

user    0m6.806s

sys     0m0.008s

CFLAGS="-march=athlon-xp -O2 -pipe -fomit-frame-pointer -ffast-math -mfpmath=sse,387 -msse -mmmx -m3dnow -fweb"

0.711392 2.631

real    0m5.373s

user    0m5.328s

sys     0m0.010s

----------

## _never_

 *adaptr wrote:*   

> I'm not so sure of that - or, at least, I would start reading the gcc mailing archives for discussions about this, of which I can assure you there are plenty...
> 
> I'm having trouble accepting that the difference between -O0 and -O1 always gives the same improvement - regardless of any other optimisation flags.
> 
> Even more scepsis is had when I see from your code that using -O2 or -O3 actually decreases performance - every single time, again regardless of other flags.
> ...

 

I have never said that -O3 actually increases code speed. I just suggested experimenting around with them. And yes, depending on your architecture setting (-march) it may even decrease code speed. This might be a reason why most distributions use -O2 instead. One thing I fully agree with you is that there can't be a single program to test compiler flags with. But this test gives users a brief idea of what has what influence to the code size and speed. And that is the purpose.

 *NewBlackDak wrote:*   

> CFLAGS="-march=athlon-xp -O3 -pipe -fomit-frame-pointer -ffast-math -mfpmath=sse,387 -msse -mmmx -m3dnow -fweb"

 

You would never want to use -ffast-math globally. It may cause some programs to fail. Read the GCC info pages for more information.

----------

## adaptr

 *Quote:*   

> But this test gives users a brief idea of what has what influence to the code size and speed

 

Yes, and the one and only conclusion every test seems to show is that -O1 gives 25-30% gains over -O0.

On any platfrom, regardless of other optimisation.

So you could just have ended your test article with saying: use -O and don't bother with anything else  :Wink: 

----------

## _never_

 *adaptr wrote:*   

>  *Quote:*   But this test gives users a brief idea of what has what influence to the code size and speed 
> 
> Yes, and the one and only conclusion every test seems to show is that -O1 gives 25-30% gains over -O0.
> 
> On any platfrom, regardless of other optimisation.
> ...

 

Expressed in other words, actually. =)

Well, in fact I would recommend -O2 instead of -O. It's a good deal between code size and speed.

----------

## adaptr

Well, I wouldn't know about that - as I said, my tests indicate that there is a zero difference between -O2 and -O3.

However, the difference between -O1 and -O2 is on the order of -5%.

----------

## _never_

 *adaptr wrote:*   

> Well, I wouldn't know about that - as I said, my tests indicate that there is a zero difference between -O2 and -O3.
> 
> However, the difference between -O1 and -O2 is on the order of -5%.

 

Yes, but -O3 increases code size a lot, because of function inlining, so I wouldn't recommend to use it. It made code faster on older processors, but current processors handle function calling very well, so there isn't much need for inlining.

----------

## adaptr

You're still missing my point : -O2 is consistently 5% slower than -O1 in every combination.

So why enable it ?

----------

## _never_

 *adaptr wrote:*   

> You're still missing my point : -O2 is consistently 5% slower than -O1 in every combination.
> 
> So why enable it ?

 

Even without your -march setting?

----------

## adaptr

For any arch; I've tested it - as shown above - with pentium3, i686 and i386 - no difference whatsoever.

----------

## thebigslide

Bumping -march won't make a diff.   MMX was more of a marketing thing than anything else, SSE  isn't used by that code.  3dnow isn't either.  The only optimizations gcc can make processor-specific is alignments (minimal) on that code.  There aren't enough loops and recursive functions and other things that are optimized by the tweaky flags.  Packages work the same way, tho.  I don't think it's that simple to make a benchmark that will tell you the best flags to use with any particular package because each packages is written a little differently and will take different optimizations in different ways.  Perhaps a good way would be a script that takes different combinations of CFLAGS and repeatedly compiles a given package's source files and times a subsequent execution to determine this on a case by case basis?  Just a thought.

My specs: Athlon-XP 2.4GHz w/ 256k of L2 cache

Iterations = 1024

GCC 3.4.3-r1

```
CFLAGS="-march=pentium-mmx -O1 -pipe";gcc -lm $CFLAGS -o cftest cftest.c;time ./cftest                                                          

0.0250746 2.631

real    1m18.980s

user    1m18.973s

sys     0m0.008s
```

and optimized:

```
CFLAGS="-march=athlon -O3 -s -w -pipe -fomit-frame-pointer -momit-leaf-frame-pointer -ftracer -mfpmath=sse"; gcc -lm $CFLAGS -o cftest cftest.c;time ./cftest

0.636471 2.631

real    1m17.489s

user    1m17.480s

sys     0m0.008s

```

now again w/o SSE: 

```
CFLAGS="-march=athlon -O3 -s -w -pipe -fomit-frame-pointer -momit-leaf-frame-pointer -ftracer -mfpmath=387" gcc -lm $CFLAGS -o cftest cftest.c;time ./cftest

1.61745 2.631

real    1m17.519s

user    1m17.512s

sys     0m0.006s
```

and the difference there is negligible, but repeateably apparent.  Now to see if Olevel does anything for us: 

```
CFLAGS="-march=athlon -O1 -s -w -pipe -fomit-frame-pointer -momit-leaf-frame-pointer -ftracer -mfpmath=sse";gcc -lm $CFLAGS -o cftest cftest.c;time ./cftest

0.619176 2.631

real    1m18.379s

user    1m18.372s

sys     0m0.007s
```

I think those results speak for themselves.

I think it would use the SSE registers more if one did some math with matrices of floats.

In summation, I get about 2% optimization max.Last edited by thebigslide on Wed Mar 02, 2005 2:05 am; edited 2 times in total

----------

## adaptr

Erm.. according to the gcc manual using -mfpmath=sse forces it to use SSE for anything that floats...

----------

## _never_

 *adaptr wrote:*   

> For any arch; I've tested it - as shown above - with pentium3, i686 and i386 - no difference whatsoever.

 

I own a Duron 1.6 GHz. This isn't true for me. On my system it improves code speed, but just a few milliseconds. It also reduces code size a bit (not only for this program, but for a few others as well), so I set -O2. Might be different for other architectures.

 *thebigslide wrote:*   

> MMX was more of a marketing thing than anything else, SSE isn't used by that code. 3dnow isn't either.

 

MMX doesn't make any difference. Even with a little assembler code I wrote, which makes use of saturated addition (an MMX feature) on large blocks, it didn't increase code speed at all. Intel has just implemented MMX for ad purposes, I guess. I have tested this on many architectures, even non-Intel.

SSE is used, as soon as you add -mfpmath=sse or -mfpmath=sse,387. However, it doesn't make much difference, either. 3dnow isn't used, that's right, but it might be used on future versions of GCC.

Currently MMX and 3dnow are only used if you work with vector variables, which this code intentionally doesn't use, since there are very few packages that use them.

 *thebigslide wrote:*   

> There aren't enough loops and recursive functions and other things that are optimized by the tweaky flags.

 

Enough loops to make the flags optimize (or deoptimize) noticably. The code optimizer doesn't handle recursion specifically.

And, thebigslide, you did one mistake when performing the test. Your commands just run gcc with empty CFLAGS. Test this command sequence:

```
ABC="test1"

ABC="test2" echo $ABC

ABC="test3" echo $ABC
```

You'll get "test1" both times. If I tell you to set the CFLAGS variable separately, I have some good reason to do so.

----------

## thebigslide

 *adaptr wrote:*   

> Erm.. according to the gcc manual using -mfpmath=sse forces it to use SSE for anything that floats...

 

True, it will use the registers, but the benefit of SSE is doing math on multiple registers simultaneously.  This won't happen with the above code.  This is shown by all our results, or turning on SSE would give 20-25% boost (my off the cuff estimate) in performance.  Try recompiling mencoder with and without SSE and benchmarking it and you'll see what I mean.  SSE is a HUGE boost in float applications like encoding audio and video.  Not sure how lame would react.

----------

## thebigslide

It turns out Lame doesn't react to cflags much (note that I axe'd -ffast-math, which is a default to to favour more intensive use of float registers)  The configure script actually hard-codes them in the makefiles, but this is easily overridden:

CFLAGS="-march=athlon-xp -O2 -s -w -pipe -fomit-frame-pointer -momit-leaf-frame-pointer -ftracer -mfpmath=sse"

time lame -mj -t -S -h -cbr -b 64 -Y mt\ -\ i\ don\'t\ wanna\ hear\ it.wav mt\ -\ i\ don\'t\ wanna\ hear\ it.mp3

LAME version 3.96.1 (http://lame.sourceforge.net/)

CPU features: MMX (ASM used), 3DNow! (ASM used), SSE

Resampling:  input 44.1 kHz  output 24 kHz

Using polyphase lowpass filter, transition band: 10935 Hz - 11226 Hz

Encoding mt - i don't wanna hear it.wav to mt - i don't wanna hear it.mp3

Encoding as 24 kHz  64 kbps j-stereo MPEG-2 Layer III (12x) qval=2

real    0m4.151s

user    0m4.100s (+/-.015 over 10 trials)

sys     0m0.030s

CFLAGS="-march=athlon-xp -O2 -s -w -pipe -fomit-frame-pointer -momit-leaf-frame-pointer -ftracer -mfpmath=387"

LAME version 3.96.1 (http://lame.sourceforge.net/)

CPU features: MMX (ASM used), 3DNow! (ASM used), SSE

Resampling:  input 44.1 kHz  output 24 kHz

Using polyphase lowpass filter, transition band: 10935 Hz - 11226 Hz

Encoding mt - i don't wanna hear it.wav to mt - i don't wanna hear it.mp3

Encoding as 24 kHz  64 kbps j-stereo MPEG-2 Layer III (12x) qval=2

real    0m4.148s

user    0m4.100s (+/-.025 over 10 trials)

sys     0m0.031s

No difference

CFLAGE="march=pentium-mmx -O1 -s -w -fomit-frame-pointer -momit-leaf-frame-pointer -ftracer -mfpmath=387"

LAME version 3.96.1 (http://lame.sourceforge.net/)

CPU features: MMX (ASM used), 3DNow! (ASM used), SSE

Resampling:  input 44.1 kHz  output 24 kHz

Using polyphase lowpass filter, transition band: 10935 Hz - 11226 Hz

Encoding mt - i don't wanna hear it.wav to mt - i don't wanna hear it.mp3

Encoding as 24 kHz  64 kbps j-stereo MPEG-2 Layer III (12x) qval=2

real    0m4.350s

user    0m4.282s (+/-.020! over 10 trials)

sys     0m0.031s

Not sure why it's still saying 3DNow! is ASM used, but it seems to have been effected by the CFLAGS setting this time.

EDIT: I botched the first run through this by not doing a make clean after the first test  :Embarassed: Last edited by thebigslide on Wed Mar 02, 2005 4:01 am; edited 1 time in total

----------

## thebigslide

mencoder, I compiled by hand because of how ebuild wonks the CFLAGS:

CFLAGS="-march=athlon-xp -O2 -s -w -pipe -fomit-frame-pointer -momit-leaf-frame-pointer -ftracer -mfpmath=sse" 

time mencoder -nosound -ovc libdv randomporno.mpg -o /incoming/randomporno.dv

real    1m17.181s

user    1m12.578s  (+/-.011) over 10 results

sys     0m1.545s

CFLAGS="-march=pentium -O2 -s -w -pipe -fomit-frame-pointer -momit-leaf-frame-pointer -ftracer -mfpmath=387"

time mencoder -nosound -ovc libdv randomporno.mpg -o /incoming/randomporno.dv

real    1m33.064s

user    1m27.494s  (+/-.022) over 10 results

sys     0m1.730s

for an improvement with SSE, MMX, 3Dnow and 3Dnow+ of 20.5%

Now to see if 3dnow or 3dnow+ did anything

CFLAGS="-mtune=athlon-xp -O2 -s -w -pipe -fomit-frame-pointer -momit-leaf-frame-pointer -ftracer -mfpmath=sse -march=pentium3"  

time mencoder -nosound -ovc libdv randomporno.mpg -o /incoming/randomporno.dv

real    1m19.201s

user    1m14.099s  (+/-.019) over 10 results

sys     0m1.677s

3Dnow and 3Dnow+ ARE being used and they're worth about 2% in this benchmark

Finally, a test to see if O2 is of benefit over O1

CFLAGS="-march=athlon-xp -O1 -s -w -pipe -fomit-frame-pointer -momit-leaf-frame-pointer -ftracer -mfpmath=sse"

time mencoder -nosound -ovc libdv randomporno.mpg -o /incoming/randomporno.dv

real    1m19.287s

user    1m13.990s (+/-.006) over 10 results

sys     0m1.596s

This is kinda dumb, but let's see what O3 does for us

CFLAGS="-march=athlon-xp -O1 -s -w -pipe -fomit-frame-pointer -momit-leaf-frame-pointer -ftracer -mfpmath=sse"

time mencoder -nosound -ovc libdv randomporno.mpg -o /incoming/randomporno.dv

real    1m19.786s

user    1m13.788s (+/-.012) over 10 results

sys     0m1.649s

****Notice that O3 made it SLOWER than O2

----------

## nerdbert

 *_never_ wrote:*   

>  -ftracer (doesn't seem to do anything; it's a rather new flag - don't use it)

 

-ftracer was introduced in 3.4. It's not optimizing code itself, but it is looking at different "paths" of codes seperately, which can help to find more optimizations. If you have a condition called a which either leads to code b or c, which both lead to d, it looks at a-b-d and a-c-d seperately before compilation, thus giving the compiler more hints what could be optimized.

(don't flame me if the terms I use aren't accurate - I'm not really into compilers)

I just read an article comparing various compiler flags for gcc 2.95, 3.2.3, 3.4.3, 3.5 and 4.0 (snapshot from January). They compiled SPECint2000 and SPECfp2000 for several CPUs (including 64 bit). For most of the benchmarks -ftracer provided a minimal speedup, but some actually ran slower.

----------

## _never_

 *nerdbert wrote:*   

>  *_never_ wrote:*    -ftracer (doesn't seem to do anything; it's a rather new flag - don't use it) 
> 
> -ftracer was introduced in 3.4. It's not optimizing code itself, but it is looking at different "paths" of codes seperately, which can help to find more optimizations. If you have a condition called a which either leads to code b or c, which both lead to d, it looks at a-b-d and a-c-d seperately before compilation, thus giving the compiler more hints what could be optimized.
> 
> (don't flame me if the terms I use aren't accurate - I'm not really into compilers)

 

Not because of inaccurate terms, but because of wrong version information. The -ftracer option is already present in GCC 3.3, which I do own. I have had bad experiences with it. It makes code slower sometimes. And what's even worse, some programs crash when compiled with this flag.

----------

## koder

I have seen a significant boost by using -O3 instead of -O2 and -O1 on my system. I am using an Intel Pentium4 2.4b (that's the one on a FSB533 and without HT on socket 478).

I was writing a small tool to calculate something for my dad, it used a lot of nested loops. Compiling with -march=pentium4 -O3 increased the speed by up to 60% for this application. But once again, this is no real-world situation. 

For that, I refer to my system. I compiled a Gentoo 2004.3 from stage 1 and optimized from the beginning. And I can tell you that having an optimized gcc really does make a difference! glibc, kernel,... if all runs at full speed, you get a faster system. 

And yes, it may increase filesize, but my entire system still fits in 3 GB, and that includes every single thing I use or may need soon!! That may have been pretty large (back in the days where we had harddisks of less that 100 MB), but on an average system with a harddrive of 200 GB and a RAM of 768 MB... it's not really that much of a problem.

I would like to add one strange thing however:

I recently migrated to lk 2.6, and I emerged glibc (to go with the NPTL feature), and strage enough, the compilation doesn't seem to use $CFLAGS as defined in /etc/make.conf! That surpised me...

greetz

koder

----------

## thebigslide

Some packages filter flags which are known to cause issues.  If you would like to live on the edge, you can make a copy of the particular ebuild in a portage overlay and modify to your hearts content.  It will probably break, though.

----------

## kimchi_sg

 *koder wrote:*   

> 
> 
> I recently migrated to lk 2.6, and I emerged glibc (to go with the NPTL feature), and strage enough, the compilation doesn't seem to use $CFLAGS as defined in /etc/make.conf! That surpised me...
> 
> 

 

That's because glibc is one of the packages which have 

```
strip-flags
```

 in the ebuild.

This will pare down your CFLAGS to nothing more than -march, -O2 -fomit-frame-pointer and -pipe.

Remove that line from the ebuild at your own peril.

----------

