# Making full use of cpu registers in CFLAGS

## TheCoop

It may suprise you, but -march=<cpu> does not turn on support for 3dnow,mmx, sse or sse2 even if your cpu supports it

Firstly, to check what registers your cpu does support just do a 'cat /proc/cpuinfo' and look for the 'flags:' line, anything your cpu support will be in there, inc mmx, 3dnow and sse/2

Next, to alter your CFLAGS to use those registers:

```
-mmmx -m3dnow -msse -msse2
```

 (delete any your cpu doesnt support)

If youve got sse support you can also add '-mfpmath=sse,387' so the maths uses both the sse and normal coprocessor registers, effectively doubling your math throughput.

This will result in much faster programs as well as more effective use of the cpu.

----------

## charlieg

Can anybody testify as to the stability of this?

Plus, is this try of all flags that appear under cat /proc/cpuinfo?

You can use mine as an example.  :Very Happy: 

Flags only:

```
# cat /proc/cpuinfo

flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse
```

Full output:

```
# cat /proc/cpuinfo

processor       : 0

vendor_id       : GenuineIntel

cpu family      : 6

model           : 8

model name      : Celeron (Coppermine)

stepping        : 6

cpu MHz         : 728.292

cache size      : 128 KB

fdiv_bug        : no

hlt_bug         : no

f00f_bug        : no

coma_bug        : no

fpu             : yes

fpu_exception   : yes

cpuid level     : 2

wp              : yes

flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse

bogomips        : 1433.60
```

----------

## charlieg

 *TheCoop wrote:*   

> It may suprise you, but -march=<cpu> does not turn on support for 3dnow,mmx, sse or sse2 even if your cpu supports it

 

The freehackers.org gccflags faq has more info, although I can't work out whether it agrees or disagrees with TheCoop's statement.

----------

## ZuNBiD

... and where do I alter my CFLAGS ??

----------

## Beekster

You can set your CFLAGS in /etc/make.conf

from my /etc/make.conf

CFLAGS="-march=pentium4 -O3 -pipe -mmmx -msse -msse2 -fomit-frame-pointer"

----------

## TheCoop

you can check for yourself by verbosly compiling a very small empty c program, which lists all the flags used. It has the -march but it doesnt list -mmmx -m3dnow or -msse

I know its somewhere in the forums, try searching for 'gcc flags' or something, ill have a more thourough look when I get home

my cflags are 

```
march=athlon-xp -O3 -pipe -fomit-frame-pointer -ffast-math -mmmx -m3dnow -msse -mfpmath=sse,387
```

 and ive had no problems at all, rock solid stable even with a barton overclocked 2500+ -> 2800+

----------

## Beekster

Just a warning that "man gcc" states:

 *Quote:*   

> sse,387
> 
> Attempt to utilize both instruction sets at once.  This effectivly double the amount of available registers and on chips with separate execution units for 387 and SSE the execution resources too.  Use this option with care, as it is still experimental, because gcc register allocator does not model separate functional units well resulting in instable performance.

 

I'm not saying it's unstable, just letting people know it is not claimed to be stable...   :Smile:  Sounds like it could give good gains.  Anyone else running with "-mfpmath=sse,387"?

----------

## chadders

I compile with these CFLAGS and it works great for me:

CFLAGS="-march=pentium4 -O3 -pipe -mmmx -msse -msse2 -mfpmath=sse -pipe -fomit-frame-pointer -fthread-jumps -fforce-addr -frerun-cse-after-loop -frerun-loop-opt -fexpensive-optimizations -falign-functions=4 -falign-jumps=4"

Chad   :Very Happy: 

----------

## lotusvale

does athlon tbird 1.4 support sse, sse2?

(i know it supports 3dnow and mmx.)

thx

----------

## Malakin

 *Quote:*   

> does athlon tbird 1.4 support sse, sse2?

 "cat /proc/cpuinfo" to see what's supported. AMD cpu's don't support sse until Athlon-xp/Duron morgan(1ghz+). No AMD cpu's currently available support sse2.

----------

## kappax

 *Malakin wrote:*   

>  *Quote:*   does athlon tbird 1.4 support sse, sse2? "cat /proc/cpuinfo" to see what's supported. AMD cpu's don't support sse until Athlon-xp/Duron morgan(1ghz+). No AMD cpu's currently available support sse2.

 

xp2400's last i check support sse2

----------

## lotusvale

thx.

 *Quote:*   

> shadowrider@localhost shadowrider $ cat /proc/cpuinfo 
> 
> processor       : 0
> 
> vendor_id       : AuthenticAMD
> ...

 

so for my tbird 1.4, would this setup be the fastest and best?

(i've read that O3 made the binaries bigger, and also it's slower at the end especially loading up the programs)

```
CFLAGS="-march=athlon-tbird -O2 -pipe -mmmx -m3dnow -fomit-frame-pointer -frerun-cse-after-loop -frerun-loop-opt -fexpensive-optimizations -falign-functions=4 -ffast-math -mfpmath=sse,387 
```

or if anybody has a better setup?

----------

## Malakin

 *Quote:*   

> xp2400's last i check support sse2

 Tomshardware - "SSE code is still accepted - but SSE2 is not."

http://www.tomshardware.com/cpu/20030210/barton-02.html

----------

## kappax

 *Malakin wrote:*   

>  *Quote:*   xp2400's last i check support sse2 Tomshardware - "SSE code is still accepted - but SSE2 is not."
> 
> http://www.tomshardware.com/cpu/20030210/barton-02.html

 

I am not too sure where Mr. tom gets his info rom. but you can gert some XP2400+'s with sse2 support, i know servel ppl that have them enabled, and have the bioses that support it on the k7s5a

check out the sis section and ocworkbench.com

----------

## Malakin

 *Quote:*   

>  I am not too sure where Mr. tom gets his info rom. but you can gert some XP2400+'s with sse2 support, i know servel ppl that have them enabled, and have the bioses that support it on the k7s5a 

 You're likely confusing this with plain sse support. Do a search for "sse2 barton" on google and you can read lots of reviews etc that mention the Barton doesn't have sse2 support. There have been no significant instruction set changes to the Athlon since the Palomino. sse2 support from AMD won't be seen until hammer.

----------

## kappax

 *Malakin wrote:*   

>  *Quote:*    I am not too sure where Mr. tom gets his info rom. but you can gert some XP2400+'s with sse2 support, i know servel ppl that have them enabled, and have the bioses that support it on the k7s5a  You're likely confusing this with plain sse support. Do a search for "sse2 barton" on google and you can read lots of reviews etc that mention the Barton doesn't have sse2 support. There have been no significant instruction set changes to the Athlon since the Palomino. sse2 support from AMD won't be seen until hammer.

 

gaahh, i am getting confized ( i was at school so i could not look in my book marks :/ o well :/    :Sad:   :Sad:   :Sad:   :Sad:   :Sad: 

----------

## Vazagi

 *Quote:*   

> It may suprise you, but -march=<cpu> does not turn on support for 3dnow,mmx, sse or sse2 even if your cpu supports it.

 

I'm getting curious. How can it be that the '-mno-sse2' flag fixes problems with '-march=pentium4' using GCC 3.2.2, if '-march=<cpu-type>' doesn't enable these flags in the first place? The '-mno-sse2' fix seems to indicate that '-march=<cpu-type>' does enable these flags. =/

----------

## jesterspet

 *Vazagi wrote:*   

> I'm getting curious. How can it be that the '-mno-sse2' flag fixes problems with '-march=pentium4' using GCC 3.2.2, if '-march=<cpu-type>' doesn't enable these flags in the first place? The '-mno-sse2' fix seems to indicate that '-march=<cpu-type>' does enable these flags. =/

 

Even though '-march=<cpu-type>' does not enable the sse2 flags, other optimisations can enable them when used in  conjunctoin with '-march=<cpu-type>'.  Exactly what those 'other optimisations' are is beyond me.  I think you would have to play around with them to figure out what they are.

I used these flags and have had no problems* with GCC 3.2.2:

```
CFLAGS="-s -march=pentium4 -mmmx -msse -msse2 -Os -fomit-frame-pointer -pipe -fexpensive-optimizations -fpic -frerun-cse-after-loop -frerun-loop-opt -foptimize-register-move -masm=intel"
```

* This includes the bug addresed here

----------

## vikwiz

 *jesterspet wrote:*   

> I used these flags and have had no problems* with GCC 3.2.2:
> 
> ```
> CFLAGS="-s -march=pentium4 -mmmx -msse -msse2 -Os -fomit-frame-pointer -pipe -fexpensive-optimizations -fpic -frerun-cse-after-loop -frerun-loop-opt -foptimize-register-move -masm=intel"
> ```
> ...

 

Are you *very* sure of this? Did you try that python int conversion stuff also?

I did compile a gentoo tree with -march=pentium4, and at the finish I saw that thread, tried the 'int' code, and it really fails. I didn't give a try to boot with this tree so. What makes the difference for you? All the other flags? I did only '-march=pentium4 -O2 -pipe'.

----------

## cerri

There are a lot of people asking "which is the best setup for XYZ cpu?". So, why not post the best setup indexed by cpu?

IE: Pentium III (M) =

CFLAGS="-march=pentium3 -O3 -pipe -fomit-frame-pointeri -fforce-addr -falign-functions=4 -fprefetch-loop-arrays"

Anyway, as reported by freehackers.org, -march=pentium3 implies -mmmx -msse...

----------

## tempy

Okay, the source of the confusion seems to be that -march enables some CPU defines (e.g. -D__SSE__ -D__MMX__ -D__3dNOW__ -D__3dNOW_A__) but does not tell GCC to actually generate its own SSE, MMX or 3DNow assembly.

What's odd is that when you specify -mmmx or any of the other CPU features, gcc -v -Q decides you are actually requesting "-mmmx -mno-mmx" at the same time. Neither is present without it, and I don't know which takes precedence. :/

-mfpmath=sse doesn't override itself, but it may require -mno-80387 to actually work.

My CPU is an Athlon XP 1700+ and my default cflags are "-march=athlon-xp -O2 -ggdb -pipe". Short and to the point.

----------

## link97381

Ahhh man now I'm gonna have to recompile EVERYTHING!!!  I have  Dual Athlon XP's so should I set mine to 

```
march=athlon-xp -O3 -pipe -fomit-frame-pointer -ffast-math -mmmx -m3dnow -msse -mfpmath=sse,387
```

 or do you have any other suggestions?

----------

## TheCoop

At last!

to check which cflags are implied by certain options, create an empty.c:

```
main() {}
```

and run:

```
gcc -v -Q empty.c <options>
```

, and all the cflags passed are in the output

with no options you get:

```

*snip*

GNU C version 3.2.2 20030322 (Gentoo Linux 1.4 3.2.2-r2) (i686-pc-linux-gnu)

        compiled by GNU C version 3.2.2 20030322 (Gentoo Linux 1.4 3.2.2-r2).

options passed:  -lang-c -v -D__GNUC__=3 -D__GNUC_MINOR__=2

 -D__GNUC_PATCHLEVEL__=2 -D__GXX_ABI_VERSION=102 -D__ELF__ -Dunix

 -D__gnu_linux__ -Dlinux -D__ELF__ -D__unix__ -D__gnu_linux__ -D__linux__

 -D__unix -D__linux -Asystem=posix -D__NO_INLINE__ -D__STDC_HOSTED__=1

 -Acpu=i386 -Amachine=i386 -Di386 -D__i386 -D__i386__ -D__tune_i686__

 -D__tune_pentiumpro__

options enabled:  -fpeephole -ffunction-cse -fkeep-static-consts

 -fpcc-struct-return -fgcse-lm -fgcse-sm -fsched-interblock -fsched-spec

 -fbranch-count-reg -fcommon -fgnu-linker -fargument-alias -fident

 -fmath-errno -ftrapping-math -m80387 -mhard-float -mno-soft-float

 -mieee-fp -mfp-ret-in-387 -mcpu=pentiumpro -march=i386

*snip*

```

and with -march=athlon-xp added you get:

```

*snip*

GNU C version 3.2.2 20030322 (Gentoo Linux 1.4 3.2.2-r2) (i686-pc-linux-gnu)

        compiled by GNU C version 3.2.2 20030322 (Gentoo Linux 1.4 3.2.2-r2).

options passed:  -lang-c -v -D__GNUC__=3 -D__GNUC_MINOR__=2

 -D__GNUC_PATCHLEVEL__=2 -D__GXX_ABI_VERSION=102 -D__ELF__ -Dunix

 -D__gnu_linux__ -Dlinux -D__ELF__ -D__unix__ -D__gnu_linux__ -D__linux__

 -D__unix -D__linux -Asystem=posix -D__NO_INLINE__ -D__STDC_HOSTED__=1

 -Acpu=i386 -Amachine=i386 -Di386 -D__i386 -D__i386__ -D__athlon

 -D__athlon__ -D__athlon_sse__ -D__tune_athlon__ -D__tune_athlon_sse__

 -D__SSE__ -D__MMX__ -D__3dNOW__ -D__3dNOW_A__ -march=athlon-xp

options enabled:  -fpeephole -ffunction-cse -fkeep-static-consts

 -fpcc-struct-return -fgcse-lm -fgcse-sm -fsched-interblock -fsched-spec

 -fbranch-count-reg -fcommon -fgnu-linker -fargument-alias -fident

 -fmath-errno -ftrapping-math -m80387 -mhard-float -mno-soft-float

 -mieee-fp -mfp-ret-in-387 -mcpu=athlon-xp -march=athlon-xp

*snip*

```

Note the lack of any -mmmx, -m3dnow or -msse

----------

## wharper

hmm, it seems that gcc set's the optimizations correctly within the compiler?

This is a snippet from: http://www.freehackers.org/gentoo/gccflags/faq.html

I have not looked at the source but this seems correct. This is also why you did not see the compiler flags on the command line when looking at the compiled binary.

In the gcc source, have a look at the file gcc-3.2/gcc/config/i386/i386.c Here's an excerpt :

Options implied by -march=

```
const processor_alias_table[] =

    {

      {"i386", PROCESSOR_I386, 0},

      {"i486", PROCESSOR_I486, 0},

      {"i586", PROCESSOR_PENTIUM, 0},

      {"pentium", PROCESSOR_PENTIUM, 0},

      {"pentium-mmx", PROCESSOR_PENTIUM, PTA_MMX},

      {"i686", PROCESSOR_PENTIUMPRO, 0},

      {"pentiumpro", PROCESSOR_PENTIUMPRO, 0},

      {"pentium2", PROCESSOR_PENTIUMPRO, PTA_MMX},

      {"pentium3", PROCESSOR_PENTIUMPRO, PTA_MMX | PTA_SSE | PTA_PREFETCH_SSE},

      {"pentium4", PROCESSOR_PENTIUM4, PTA_SSE | PTA_SSE2 |

                                       PTA_MMX | PTA_PREFETCH_SSE},

      {"k6", PROCESSOR_K6, PTA_MMX},

      {"k6-2", PROCESSOR_K6, PTA_MMX | PTA_3DNOW},

      {"k6-3", PROCESSOR_K6, PTA_MMX | PTA_3DNOW},

      {"athlon", PROCESSOR_ATHLON, PTA_MMX | PTA_PREFETCH_SSE | PTA_3DNOW

                                   | PTA_3DNOW_A},

      {"athlon-tbird", PROCESSOR_ATHLON, PTA_MMX | PTA_PREFETCH_SSE

                                         | PTA_3DNOW | PTA_3DNOW_A},

      {"athlon-4", PROCESSOR_ATHLON, PTA_MMX | PTA_PREFETCH_SSE | PTA_3DNOW

                                    | PTA_3DNOW_A | PTA_SSE},

      {"athlon-xp", PROCESSOR_ATHLON, PTA_MMX | PTA_PREFETCH_SSE | PTA_3DNOW

                                      | PTA_3DNOW_A | PTA_SSE},

      {"athlon-mp", PROCESSOR_ATHLON, PTA_MMX | PTA_PREFETCH_SSE | PTA_3DNOW

                                      | PTA_3DNOW_A | PTA_SSE},

    };
```

It also seems that -O3 does all of the other nifty flags as well?

```
 if (optimize >= 1)

    {

      flag_defer_pop = 1;

      flag_thread_jumps = 1;

#ifdef DELAY_SLOTS

      flag_delayed_branch = 1;

#endif

#ifdef CAN_DEBUG_WITHOUT_FP

      flag_omit_frame_pointer = 1;

#endif

      flag_guess_branch_prob = 1;

      flag_cprop_registers = 1;

    }

  if (optimize >= 2)

    {

      flag_optimize_sibling_calls = 1;

      flag_cse_follow_jumps = 1;

      flag_cse_skip_blocks = 1;

      flag_gcse = 1;

      flag_expensive_optimizations = 1;

      flag_strength_reduce = 1;

      flag_rerun_cse_after_loop = 1;

      flag_rerun_loop_opt = 1;

      flag_caller_saves = 1;

      flag_force_mem = 1;

      flag_peephole2 = 1;

#ifdef INSN_SCHEDULING

      flag_schedule_insns = 1;

      flag_schedule_insns_after_reload = 1;

#endif

      flag_regmove = 1;

      flag_strict_aliasing = 1;

      flag_delete_null_pointer_checks = 1;

      flag_reorder_blocks = 1;

    }

  if (optimize >= 3)

    {

      flag_inline_functions = 1;

      flag_rename_registers = 1;

    }
```

my .02

----------

## janderson

 *jesterspet wrote:*   

> 
> 
> I used these flags and have had no problems* with GCC 3.2.2:
> 
> ```
> ...

 

Seems you've gone to some trouble to optimize for speed, but you may be incurring some not-so-good penalties by using -Os. I believe -Os causes certain data types, functions and perhaps some other things to be misaligned to make the binary smaller. The misalignment will mean that you take a significant performance hit when fetching misaligned data. Since you probably have a good chunk of memory and a fast CPU, you probably don't want to use -Os.

Cheers,

jon

----------

## Gnufsh

I know that with -march=athlon-xp, sse, 3dnow, and mmx are turned on, adding -msse, -m3dnow, and -mmmx automatically turns on -mno-sse, -mno-mmx, and -mno-3dnow. Note the gcc -march=athlon-xp -Q -v output of a test file:

 *Quote:*   

> 
> 
> options passed:  -lang-c -v -D__GNUC__=3 -D__GNUC_MINOR__=2
> 
>  -D__GNUC_PATCHLEVEL__=2 -D__GXX_ABI_VERSION=102 -D__ELF__ -Dunix
> ...

 

see the -D__SSE__ -D__MMX__ -D__3dNOW__ -D__3dNOW_A__? these are the macros that turn sse and such on. Now without the -march:

 *Quote:*   

> 
> 
> options passed:  -lang-c -v -D__GNUC__=3 -D__GNUC_MINOR__=2
> 
>  -D__GNUC_PATCHLEVEL__=2 -D__GXX_ABI_VERSION=102 -D__ELF__ -Dunix
> ...

 

and gcc -Q -v -mmmx -msse -m3dnow:

 *Quote:*   

> 
> 
> ptions passed:  -lang-c -v -D__GNUC__=3 -D__GNUC_MINOR__=2
> 
>  -D__GNUC_PATCHLEVEL__=2 -D__GXX_ABI_VERSION=102 -D__ELF__ -Dunix
> ...

 

see how the -D stuff changes?

One more with -march and -msse, etc:

 *Quote:*   

> 
> 
> options passed:  -lang-c -v -D__GNUC__=3 -D__GNUC_MINOR__=2
> 
>  -D__GNUC_PATCHLEVEL__=2 -D__GXX_ABI_VERSION=102 -D__ELF__ -Dunix
> ...

 

see the " -mmmx -mno-mmx -m3dnow -mno-3dnow -msse -mno-sse" in options enabled? mmx, sse, 3dnow and things are enabled by -march-athlin-xp already. And no athlon-xp supports sse2. Perhaps the 2400+ supports sse too (as in also...). I'm also not yet sure of the performance gains from using -mfpmath=sse,387. Anybody know of a good way to benchmark this?

----------

## AlterEgo

 *Gnufsh wrote:*   

>  Anybody know of a good way to benchmark this?

 

I've used freebench to test the effect of Cflags. www.freebench.org.

It offers six different benchmarks and a limited online comparison database.

----------

## magnet

I use the -mfpmath=sse,387 thinggy.

let's recompile the whole system, I'll post what will happend.

should I benchmark it before/after ? with glxgears maybe ?

----------

## magnet

lol this is a anwser-before-a-question  :Smile: 

thx.  :Cool: 

----------

## barlad

Yeah let us know how those  p4 optimizations work out please! After having read all those threads about cpu flag I am quite confused and I am not sure wether I should re compile fully or not my system. 

I have been using -march pentium4 -O3 so far but I saw there were a lot of other stuff available. Your benchmarks would help in taking a decision!  :Wink: 

----------

## magnet

-march=pentium4 is broken, search for it in the forums.  :Crying or Very sad: 

----------

## barlad

yeah I heard about it. I have not had the slightest problem with it though. I have that bug with overflow in python/php but it does not seem to have had any impact so far

.

That's why I am waiting on some tests to see if -march pentium4  is really different from -march pentium3 -mcpu pentium4. If everything works fine, I don't want to recompile everything and end up with reduced performances  :Smile: 

----------

## magnet

I'm exactly in the same situation as you.

I 've read in some threads that using -march=pentium4 will be slower than using -march=pentium3 -mcpu=pentium4.  :Rolling Eyes: 

----------

## kappax

I am still just a little confuzed why 

 -msse 

would turn on -no-msse to me, if i take the time to tell it "-msse" it better damn well be using "-msse" 

anyway

I compiled my kde with

-mmmx -msse -m3dnow and it seems faster

----------

## kappax

ooo one way to test this is to compile mplayer with the flags, and if it does not give its speal about mmx or sse when starting then it did disable mmx and sse

----------

## kappax

this is what i get 

```

CPU: Advanced Micro Devices Athlon 4 PM Palomino/Athlon MP Multiprocessor/Athlon XP eXtreme Performance (Family: 6, Stepping: 2)

Detected cache-line size is 64 bytes

SSE supported but disabled

CPUflags:  MMX: 1 MMX2: 1 3DNow: 1 3DNow2: 1 SSE: 0 SSE2: 0

Compiled for x86 CPU with extensions: MMX MMX2 3DNow 3DNowEx

```

whith flags

```

CFLAGS="-O3 -march=athlon-xp -pipe -fomit-frame-pointer -ffast-math -mmmx -msse -m3dnow -O3 -mfpmath=sse,387 "

```

hahah i have -03 2 timnes! ahhH!!

----------

## eradicator

According to freehackers.org:

 *Quote:*   

> -mmmx, -msse are implied by -march=pentium3 

 

----------

## Gnufsh

Here's someresults from nbench compiled with different flags:

(At the bottom, I did three runs each of two different cflags settings. Note the disparity between the three runs with the same cflags)

-mmmx -mno-mmx -m3dnow -mno-3dnow -msse -mno-sse

 BYTEmark* Native Mode Benchmark ver. 2 (10/95)

Index-split by Andrew D. Balsa (11/97)

Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index

                    :                  : Pentium 90* : AMD K6/233*

--------------------:------------------:-------------:------------

NUMERIC SORT        :          1584.2  :      40.63  :      13.34

STRING SORT         :          106.84  :      47.74  :       7.39

BITFIELD            :      3.9963e+08  :      68.55  :      14.32

FP EMULATION        :          176.08  :      84.49  :      19.50

FOURIER             :           18356  :      20.88  :      11.73

ASSIGNMENT          :          26.736  :     101.74  :      26.39

IDEA                :          3161.9  :      48.36  :      14.36

HUFFMAN             :          1354.6  :      37.56  :      12.00

NEURAL NET          :          33.081  :      53.14  :      22.35

LU DECOMPOSITION    :            1085  :      56.21  :      40.59

==========================ORIGINAL BYTEMARK RESULTS==========================

INTEGER INDEX       : 57.491

FLOATING-POINT INDEX: 39.653

Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0

==============================LINUX DATA BELOW===============================

C compiler          : 3.2.2

libc                : unknown version

MEMORY INDEX        : 14.081

INTEGER INDEX       : 14.549

FLOATING-POINT INDEX: 21.993

Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

* Trademarks are property of their respective holder.

-O2 -mcpu=i686 -pipe

BYTEmark* Native Mode Benchmark ver. 2 (10/95)

Index-split by Andrew D. Balsa (11/97)

Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index

                    :                  : Pentium 90* : AMD K6/233*

--------------------:------------------:-------------:------------

NUMERIC SORT        :          1112.5  :      28.53  :       9.37

STRING SORT         :          115.67  :      51.68  :       8.00

BITFIELD            :        2.95e+08  :      50.60  :      10.57

FP EMULATION        :          69.649  :      33.42  :       7.71

FOURIER             :           18412  :      20.94  :      11.76

ASSIGNMENT          :          18.052  :      68.69  :      17.82

IDEA                :          2114.1  :      32.33  :       9.60

HUFFMAN             :            1126  :      31.23  :       9.97

NEURAL NET          :          28.048  :      45.06  :      18.95

LU DECOMPOSITION    :           919.2  :      47.62  :      34.39

==========================ORIGINAL BYTEMARK RESULTS==========================

INTEGER INDEX       : 40.310

FLOATING-POINT INDEX: 35.549

Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0

==============================LINUX DATA BELOW===============================

C compiler          : 3.2.2

libc                : unknown version

MEMORY INDEX        : 11.464

INTEGER INDEX       : 9.120

FLOATING-POINT INDEX: 19.717

Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

* Trademarks are property of their respective holder.

-march=athlon-x[ -O2 -pipe

BYTEmark* Native Mode Benchmark ver. 2 (10/95)

Index-split by Andrew D. Balsa (11/97)

Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index

                    :                  : Pentium 90* : AMD K6/233*

--------------------:------------------:-------------:------------

NUMERIC SORT        :          1557.1  :      39.93  :      13.11

STRING SORT         :          114.28  :      51.06  :       7.90

BITFIELD            :      3.3042e+08  :      56.68  :      11.84

FP EMULATION        :           82.92  :      39.79  :       9.18

FOURIER             :           18349  :      20.87  :      11.72

ASSIGNMENT          :          21.266  :      80.92  :      20.99

IDEA                :          1981.6  :      30.31  :       9.00

HUFFMAN             :          1209.8  :      33.55  :      10.71

NEURAL NET          :          27.523  :      44.21  :      18.60

LU DECOMPOSITION    :          1044.1  :      54.09  :      39.06

==========================ORIGINAL BYTEMARK RESULTS==========================

INTEGER INDEX       : 45.080

FLOATING-POINT INDEX: 36.816

Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0

==============================LINUX DATA BELOW===============================

C compiler          : 3.2.2

libc                : unknown version

MEMORY INDEX        : 12.523

INTEGER INDEX       : 10.380

FLOATING-POINT INDEX: 20.419

Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

* Trademarks are property of their respective holder.

-march=athlon-xp -O2 -fomit-frame-pointer -pipe

BYTEmark* Native Mode Benchmark ver. 2 (10/95)

Index-split by Andrew D. Balsa (11/97)

Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index

                    :                  : Pentium 90* : AMD K6/233*

--------------------:------------------:-------------:------------

NUMERIC SORT        :          1593.6  :      40.87  :      13.42

STRING SORT         :          114.76  :      51.28  :       7.94

BITFIELD            :      3.3532e+08  :      57.52  :      12.01

FP EMULATION        :           89.56  :      42.98  :       9.92

FOURIER             :           18340  :      20.86  :      11.72

ASSIGNMENT          :          20.759  :      78.99  :      20.49

IDEA                :          2061.4  :      31.53  :       9.36

HUFFMAN             :          1210.8  :      33.58  :      10.72

NEURAL NET          :          25.778  :      41.41  :      17.42

LU DECOMPOSITION    :          986.56  :      51.11  :      36.91

==========================ORIGINAL BYTEMARK RESULTS==========================

INTEGER INDEX       : 45.960

FLOATING-POINT INDEX: 35.341

Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0

==============================LINUX DATA BELOW===============================

C compiler          : 3.2.2

libc                : unknown version

MEMORY INDEX        : 12.501

INTEGER INDEX       : 10.751

FLOATING-POINT INDEX: 19.601

Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

* Trademarks are property of their respective holder.

-march=athlon-xp -O3 -fomit-frame-pointer -pipe

BYTEmark* Native Mode Benchmark ver. 2 (10/95)

Index-split by Andrew D. Balsa (11/97)

Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index

                    :                  : Pentium 90* : AMD K6/233*

--------------------:------------------:-------------:------------

NUMERIC SORT        :          1596.5  :      40.94  :      13.45

STRING SORT         :          115.52  :      51.62  :       7.99

BITFIELD            :      3.3783e+08  :      57.95  :      12.10

FP EMULATION        :          134.03  :      64.31  :      14.84

FOURIER             :           18340  :      20.86  :      11.72

ASSIGNMENT          :          20.782  :      79.08  :      20.51

IDEA                :          3173.8  :      48.54  :      14.41

HUFFMAN             :          1196.2  :      33.17  :      10.59

NEURAL NET          :          26.129  :      41.97  :      17.66

LU DECOMPOSITION    :          1019.4  :      52.81  :      38.13

==========================ORIGINAL BYTEMARK RESULTS==========================

INTEGER INDEX       : 51.816

FLOATING-POINT INDEX: 35.890

Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0

==============================LINUX DATA BELOW===============================

C compiler          : 3.2.2

libc                : unknown version

MEMORY INDEX        : 12.565

INTEGER INDEX       : 13.211

FLOATING-POINT INDEX: 19.906

Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

* Trademarks are property of their respective holder.

-march=athlon-xp -O3 -fomit-frame-pointer -pipe -fprefetch-loop-arrays

BYTEmark* Native Mode Benchmark ver. 2 (10/95)

Index-split by Andrew D. Balsa (11/97)

Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index

                    :                  : Pentium 90* : AMD K6/233*

--------------------:------------------:-------------:------------

NUMERIC SORT        :          1581.8  :      40.57  :      13.32

STRING SORT         :          114.72  :      51.26  :       7.93

BITFIELD            :      3.2762e+08  :      56.20  :      11.74

FP EMULATION        :          133.44  :      64.03  :      14.78

FOURIER             :           18269  :      20.78  :      11.67

ASSIGNMENT          :          20.813  :      79.20  :      20.54

IDEA                :          3183.1  :      48.68  :      14.45

HUFFMAN             :          1284.1  :      35.61  :      11.37

NEURAL NET          :          26.179  :      42.05  :      17.69

LU DECOMPOSITION    :          1069.7  :      55.41  :      40.01

==========================ORIGINAL BYTEMARK RESULTS==========================

INTEGER INDEX       : 51.994

FLOATING-POINT INDEX: 36.447

Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0

==============================LINUX DATA BELOW===============================

C compiler          : 3.2.2

libc                : unknown version

MEMORY INDEX        : 12.414

INTEGER INDEX       : 13.411

FLOATING-POINT INDEX: 20.215

Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

* Trademarks are property of their respective holder.

-march=athlon-xp -O3 -fomit-frame-pointer -pipe -funroll-loops

TEST                : Iterations/sec.  : Old Index   : New Index

                    :                  : Pentium 90* : AMD K6/233*

--------------------:------------------:-------------:------------

NUMERIC SORT        :            1605  :      41.16  :      13.52

STRING SORT         :           107.4  :      47.99  :       7.43

BITFIELD            :      3.9864e+08  :      68.38  :      14.28

FP EMULATION        :          182.36  :      87.50  :      20.19

FOURIER             :           18411  :      20.94  :      11.76

ASSIGNMENT          :          26.576  :     101.13  :      26.23

IDEA                :          3182.9  :      48.68  :      14.45

HUFFMAN             :          1366.9  :      37.90  :      12.10

NEURAL NET          :          33.054  :      53.10  :      22.34

LU DECOMPOSITION    :          1006.6  :      52.15  :      37.66

==========================ORIGINAL BYTEMARK RESULTS==========================

INTEGER INDEX       : 57.990

FLOATING-POINT INDEX: 38.703

Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0

==============================LINUX DATA BELOW===============================

C compiler          : 3.2.2

libc                : unknown version

MEMORY INDEX        : 14.066

INTEGER INDEX       : 14.782

FLOATING-POINT INDEX: 21.466

Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

* Trademarks are property of their respective holder.

-march=athlon-xp -O3 -fomit-frame-pointer -pipe -funroll-loops -finline-functions

BYTEmark* Native Mode Benchmark ver. 2 (10/95)

Index-split by Andrew D. Balsa (11/97)

Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index

                    :                  : Pentium 90* : AMD K6/233*

--------------------:------------------:-------------:------------

NUMERIC SORT        :          1601.9  :      41.08  :      13.49

STRING SORT         :           107.2  :      47.90  :       7.41

BITFIELD            :      4.0036e+08  :      68.68  :      14.34

FP EMULATION        :          182.28  :      87.47  :      20.18

FOURIER             :           18364  :      20.88  :      11.73

ASSIGNMENT          :          26.587  :     101.17  :      26.24

IDEA                :          3184.5  :      48.71  :      14.46

HUFFMAN             :          1365.8  :      37.87  :      12.09

NEURAL NET          :          33.027  :      53.06  :      22.32

LU DECOMPOSITION    :          1091.2  :      56.53  :      40.82

==========================ORIGINAL BYTEMARK RESULTS==========================

INTEGER INDEX       : 57.992

FLOATING-POINT INDEX: 39.713

Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0

==============================LINUX DATA BELOW===============================

C compiler          : 3.2.2

libc                : unknown version

MEMORY INDEX        : 14.079

INTEGER INDEX       : 14.773

FLOATING-POINT INDEX: 22.026

Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

* Trademarks are property of their respective holder.

-march=athlon-xp -O3 -fomit-frame-pointer -pipe -funroll-loops -finline-functions -mfpmath=sse

BYTEmark* Native Mode Benchmark ver. 2 (10/95)

Index-split by Andrew D. Balsa (11/97)

Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index

                    :                  : Pentium 90* : AMD K6/233*

--------------------:------------------:-------------:------------

NUMERIC SORT        :          1607.4  :      41.22  :      13.54

STRING SORT         :          107.28  :      47.94  :       7.42

BITFIELD            :      3.9954e+08  :      68.54  :      14.32

FP EMULATION        :          182.48  :      87.56  :      20.21

FOURIER             :           18364  :      20.88  :      11.73

ASSIGNMENT          :          26.558  :     101.06  :      26.21

IDEA                :          3182.9  :      48.68  :      14.45

HUFFMAN             :          1366.3  :      37.89  :      12.10

NEURAL NET          :          33.134  :      53.23  :      22.39

LU DECOMPOSITION    :          1032.7  :      53.50  :      38.63

==========================ORIGINAL BYTEMARK RESULTS==========================

INTEGER INDEX       : 58.009

FLOATING-POINT INDEX: 39.032

Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0

==============================LINUX DATA BELOW===============================

C compiler          : 3.2.2

libc                : unknown version

MEMORY INDEX        : 14.068

INTEGER INDEX       : 14.789

FLOATING-POINT INDEX: 21.649

Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

* Trademarks are property of their respective holder.

]

-march=athlon-xp -O3 -fomit-frame-pointer -pipe -funroll-loops -finline-functions -mfpmath=sse,387

BYTEmark* Native Mode Benchmark ver. 2 (10/95)

Index-split by Andrew D. Balsa (11/97)

Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index

                    :                  : Pentium 90* : AMD K6/233*

--------------------:------------------:-------------:------------

NUMERIC SORT        :          1612.6  :      41.36  :      13.58

STRING SORT         :          106.92  :      47.77  :       7.39

BITFIELD            :      3.9792e+08  :      68.26  :      14.26

FP EMULATION        :          182.28  :      87.47  :      20.18

FOURIER             :           18380  :      20.90  :      11.74

ASSIGNMENT          :          26.547  :     101.02  :      26.20

IDEA                :          3184.4  :      48.70  :      14.46

HUFFMAN             :          1366.3  :      37.89  :      12.10

NEURAL NET          :           33.16  :      53.27  :      22.41

LU DECOMPOSITION    :          1045.7  :      54.17  :      39.12

==========================ORIGINAL BYTEMARK RESULTS==========================

INTEGER INDEX       : 57.966

FLOATING-POINT INDEX: 39.217

Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0

==============================LINUX DATA BELOW===============================

C compiler          : 3.2.2

libc                : unknown version

MEMORY INDEX        : 14.031

INTEGER INDEX       : 14.799

FLOATING-POINT INDEX: 21.751

Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

* Trademarks are property of their respective holder.

-march=athlon-xp -O3 -fomit-frame-pointer -pipe -funroll-loops -finline-functions -falign-functions

BYTEmark* Native Mode Benchmark ver. 2 (10/95)

Index-split by Andrew D. Balsa (11/97)

Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index

                    :                  : Pentium 90* : AMD K6/233*

--------------------:------------------:-------------:------------

NUMERIC SORT        :          1601.4  :      41.07  :      13.49

STRING SORT         :          107.48  :      48.03  :       7.43

BITFIELD            :      3.9724e+08  :      68.14  :      14.23

FP EMULATION        :          182.28  :      87.47  :      20.18

FOURIER             :           18396  :      20.92  :      11.75

ASSIGNMENT          :          26.515  :     100.90  :      26.17

IDEA                :          3175.1  :      48.56  :      14.42

HUFFMAN             :          1360.4  :      37.72  :      12.05

NEURAL NET          :          32.869  :      52.80  :      22.21

LU DECOMPOSITION    :          1013.3  :      52.49  :      37.91

==========================ORIGINAL BYTEMARK RESULTS==========================

INTEGER INDEX       : 57.867

FLOATING-POINT INDEX: 38.704

Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0

==============================LINUX DATA BELOW===============================

C compiler          : 3.2.2

libc                : unknown version

MEMORY INDEX        : 14.042

INTEGER INDEX       : 14.746

FLOATING-POINT INDEX: 21.467

Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

* Trademarks are property of their respective holder.

3 runs niced to -19

-march=athlon-xp -O3 -fomit-frame-pointer -pipe -funroll-loops -finline-functions

BYTEmark* Native Mode Benchmark ver. 2 (10/95)

Index-split by Andrew D. Balsa (11/97)

Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index

                    :                  : Pentium 90* : AMD K6/233*

--------------------:------------------:-------------:------------

NUMERIC SORT        :          1606.8  :      41.21  :      13.53

STRING SORT         :          107.24  :      47.92  :       7.42

BITFIELD            :      3.9925e+08  :      68.48  :      14.30

FP EMULATION        :          182.88  :      87.75  :      20.25

FOURIER             :           18404  :      20.93  :      11.76

ASSIGNMENT          :          26.624  :     101.31  :      26.28

IDEA                :          3191.1  :      48.81  :      14.49

HUFFMAN             :            1368  :      37.93  :      12.11

NEURAL NET          :          33.094  :      53.16  :      22.36

LU DECOMPOSITION    :          1024.6  :      53.08  :      38.33

==========================ORIGINAL BYTEMARK RESULTS==========================

INTEGER INDEX       : 58.066

FLOATING-POINT INDEX: 38.943

Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0

==============================LINUX DATA BELOW===============================

C compiler          : 3.2.2

libc                : unknown version

MEMORY INDEX        : 14.074

INTEGER INDEX       : 14.810

FLOATING-POINT INDEX: 21.599

Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

* Trademarks are property of their respective holder.

BYTEmark* Native Mode Benchmark ver. 2 (10/95)

Index-split by Andrew D. Balsa (11/97)

Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index

                    :                  : Pentium 90* : AMD K6/233*

--------------------:------------------:-------------:------------

NUMERIC SORT        :          1606.4  :      41.20  :      13.53

STRING SORT         :           107.8  :      48.17  :       7.46

BITFIELD            :       3.983e+08  :      68.32  :      14.27

FP EMULATION        :          183.04  :      87.83  :      20.27

FOURIER             :           18411  :      20.94  :      11.76

ASSIGNMENT          :          26.608  :     101.25  :      26.26

IDEA                :          3191.1  :      48.81  :      14.49

HUFFMAN             :          1368.5  :      37.95  :      12.12

NEURAL NET          :          33.107  :      53.18  :      22.37

LU DECOMPOSITION    :          1028.2  :      53.26  :      38.46

==========================ORIGINAL BYTEMARK RESULTS==========================

INTEGER INDEX       : 58.093

FLOATING-POINT INDEX: 38.998

Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0

==============================LINUX DATA BELOW===============================

C compiler          : 3.2.2

libc                : unknown version

MEMORY INDEX        : 14.085

INTEGER INDEX       : 14.813

FLOATING-POINT INDEX: 21.630

Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

* Trademarks are property of their respective holder.

BYTEmark* Native Mode Benchmark ver. 2 (10/95)

Index-split by Andrew D. Balsa (11/97)

Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index

                    :                  : Pentium 90* : AMD K6/233*

--------------------:------------------:-------------:------------

NUMERIC SORT        :          1597.6  :      40.97  :      13.46

STRING SORT         :           107.4  :      47.99  :       7.43

BITFIELD            :      3.9998e+08  :      68.61  :      14.33

FP EMULATION        :          182.92  :      87.77  :      20.25

FOURIER             :           18405  :      20.93  :      11.76

ASSIGNMENT          :          26.603  :     101.23  :      26.26

IDEA                :          3189.8  :      48.79  :      14.49

HUFFMAN             :          1367.4  :      37.92  :      12.11

NEURAL NET          :          33.107  :      53.18  :      22.37

LU DECOMPOSITION    :          1021.6  :      52.92  :      38.22

==========================ORIGINAL BYTEMARK RESULTS==========================

INTEGER INDEX       : 58.035

FLOATING-POINT INDEX: 38.910

Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0

==============================LINUX DATA BELOW===============================

C compiler          : 3.2.2

libc                : unknown version

MEMORY INDEX        : 14.086

INTEGER INDEX       : 14.786

FLOATING-POINT INDEX: 21.581

Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

* Trademarks are property of their respective holder.

3 runs niced to -19

-march=athlon-xp -O3 -fomit-frame-pointer -pipe -funroll-loops -finline-functions -mfpmath=sse,387

BYTEmark* Native Mode Benchmark ver. 2 (10/95)

Index-split by Andrew D. Balsa (11/97)

Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index

                    :                  : Pentium 90* : AMD K6/233*

--------------------:------------------:-------------:------------

NUMERIC SORT        :          1604.6  :      41.15  :      13.52

STRING SORT         :          108.71  :      48.58  :       7.52

BITFIELD            :       4.009e+08  :      68.77  :      14.36

FP EMULATION        :          182.96  :      87.79  :      20.26

FOURIER             :           18324  :      20.84  :      11.70

ASSIGNMENT          :          26.685  :     101.54  :      26.34

IDEA                :          3188.5  :      48.77  :      14.48

HUFFMAN             :            1368  :      37.93  :      12.11

NEURAL NET          :          33.227  :      53.38  :      22.45

LU DECOMPOSITION    :          1030.6  :      53.39  :      38.55

==========================ORIGINAL BYTEMARK RESULTS==========================

INTEGER INDEX       : 58.219

FLOATING-POINT INDEX: 39.013

Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0

==============================LINUX DATA BELOW===============================

C compiler          : 3.2.2

libc                : unknown version

MEMORY INDEX        : 14.169

INTEGER INDEX       : 14.803

FLOATING-POINT INDEX: 21.638

Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

* Trademarks are property of their respective holder.

BYTEmark* Native Mode Benchmark ver. 2 (10/95)

Index-split by Andrew D. Balsa (11/97)

Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index

                    :                  : Pentium 90* : AMD K6/233*

--------------------:------------------:-------------:------------

NUMERIC SORT        :          1614.9  :      41.41  :      13.60

STRING SORT         :          108.04  :      48.28  :       7.47

BITFIELD            :      3.9851e+08  :      68.36  :      14.28

FP EMULATION        :          182.96  :      87.79  :      20.26

FOURIER             :           18364  :      20.88  :      11.73

ASSIGNMENT          :          26.685  :     101.54  :      26.34

IDEA                :          3189.8  :      48.79  :      14.49

HUFFMAN             :          1368.5  :      37.95  :      12.12

NEURAL NET          :           33.24  :      53.40  :      22.46

LU DECOMPOSITION    :          1046.8  :      54.23  :      39.16

==========================ORIGINAL BYTEMARK RESULTS==========================

INTEGER INDEX       : 58.177

FLOATING-POINT INDEX: 39.250

Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0

==============================LINUX DATA BELOW===============================

C compiler          : 3.2.2

libc                : unknown version

MEMORY INDEX        : 14.111

INTEGER INDEX       : 14.830

FLOATING-POINT INDEX: 21.770

Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

* Trademarks are property of their respective holder.

BYTEmark* Native Mode Benchmark ver. 2 (10/95)

Index-split by Andrew D. Balsa (11/97)

Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index

                    :                  : Pentium 90* : AMD K6/233*

--------------------:------------------:-------------:------------

NUMERIC SORT        :            1613  :      41.37  :      13.59

STRING SORT         :          107.44  :      48.01  :       7.43

BITFIELD            :      4.0225e+08  :      69.00  :      14.41

FP EMULATION        :          182.92  :      87.77  :      20.25

FOURIER             :           18411  :      20.94  :      11.76

ASSIGNMENT          :          26.635  :     101.35  :      26.29

IDEA                :          3191.1  :      48.81  :      14.49

HUFFMAN             :          1368.5  :      37.95  :      12.12

NEURAL NET          :          33.227  :      53.38  :      22.45

LU DECOMPOSITION    :          1092.4  :      56.59  :      40.86

==========================ORIGINAL BYTEMARK RESULTS==========================

INTEGER INDEX       : 58.185

FLOATING-POINT INDEX: 39.842

Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0

==============================LINUX DATA BELOW===============================

C compiler          : 3.2.2

libc                : unknown version

MEMORY INDEX        : 14.120

INTEGER INDEX       : 14.826

FLOATING-POINT INDEX: 22.098

Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

* Trademarks are property of their respective holder.

----------

## Malakin

Here is a simple test that proves -march=athlon-xp enables sse, mmx and 3dnow support.

What I've done is emerge "jpeg" with a bunch of different cflags testing libjpeg.so.62.0.0 to see if it's md5sum has changed, if it hasn't changed then gcc definitely didn't use the added cflag.

I'm using a base of "-march=athlon-xp -O2 -pipe" and adding to it.

Here are the results:

NO -msse

NO -mmmx

NO -m3dnow

YES -mfpmath=sse,387

I then tried all the flags at the same time, md5 was the same as just using -mfpath=sse,387 (as was expected).

This proves that sse, mmx and 3dnow support are all enabled with -march=athlon-xp.

So in the end if you're using -march=athlon-xp don't worry about all this other stuff cause it doesn't make any difference.

Comments on Gnufsh's testing:

-fprefetch and -falign-functions are enabled with -O2.

-finline-functions is enabled with -O3.

I doubt using -mfpmath=sse,387 makes any actual performance difference with anything, someone please prove me wrong.

----------

## TheCoop

...

----------

## Gnufsh

 *Malakin wrote:*   

> 
> 
> Comments on Gnufsh's testing:
> 
> -fprefetch and -falign-functions are enabled with -O2.
> ...

 

Here's my gcc -Q -v output for -march=athlon-xp -O3 -fomit-frame-pointer on a test file:

options passed:  -lang-c -v -D__GNUC__=3 -D__GNUC_MINOR__=2

 -D__GNUC_PATCHLEVEL__=2 -D__GXX_ABI_VERSION=102 -D__ELF__ -Dunix

 -D__gnu_linux__ -Dlinux -D__ELF__ -D__unix__ -D__gnu_linux__ -D__linux__

 -D__unix -D__linux -Asystem=posix -D__OPTIMIZE__ -D__STDC_HOSTED__=1

 -Acpu=i386 -Amachine=i386 -Di386 -D__i386 -D__i386__ -D__athlon

 -D__athlon__ -D__athlon_sse__ -D__tune_athlon__ -D__tune_athlon_sse__

 -D__SSE__ -D__MMX__ -D__3dNOW__ -D__3dNOW_A__ -march=athlon-xp -O3

 -fomit-frame-pointer

options enabled:  -fdefer-pop -fomit-frame-pointer -foptimize-sibling-calls

 -fcse-follow-jumps -fcse-skip-blocks -fexpensive-optimizations

 -fthread-jumps -fstrength-reduce -fpeephole -fforce-mem -ffunction-cse

 -fkeep-static-consts -fcaller-saves -fpcc-struct-return -fgcse -fgcse-lm

 -fgcse-sm -frerun-cse-after-loop -frerun-loop-opt

 -fdelete-null-pointer-checks -fschedule-insns2 -fsched-interblock

 -fsched-spec -fbranch-count-reg -freorder-blocks -frename-registers

 -fcprop-registers -fcommon -fgnu-linker -fregmove -foptimize-register-move

 -fargument-alias -fstrict-aliasing -fmerge-constants -fident -fpeephole2

 -fguess-branch-probability -fmath-errno -ftrapping-math -m80387

 -mhard-float -mno-soft-float -mieee-fp -mfp-ret-in-387 -mcpu=athlon-xp

 -march=athlon-xp

It doesn't show -finline-functions, -fprefetch-loop-arrays, or -falign-functions. Also, so far using freebench I see differentt results by ading -fprefetch-loop-arrays on some of the tests.

----------

## Malakin

 *Quote:*   

>  It doesn't show -finline-functions, -fprefetch-loop-arrays, or -falign-functions. Also, so far using freebench I see differentt results by ading -fprefetch-loop-arrays on some of the tests.

 It's possible the manual is wrong.

http://gcc.gnu.org/onlinedocs/gcc-3.2.2/gcc/Optimize-Options.html#Optimize%20Options

Near the top there's this: *Quote:*   

> -O3
> 
>     Optimize yet more. -O3 turns on all optimizations specified by -O2 and also turns on the -finline-functions and -frename-registers options.

 

Scroll down the page a bit and you'll see this: *Quote:*   

> The following options control specific optimizations. The -O2 option turns on all of these optimizations except -funroll-loops and -funroll-all-loops. On most machines, the -O option turns on the -fthread-jumps and -fdelayed-branch options, but specific machines may handle it differently

 

Ok I decided to test them out before posting this.

-falign-functions is included in -O2 and -finline-functions is included in -O3 but -fprefetch-loop-arrays isn't included in O2 so the manual is wrong on that one. I used the same test as I did for the other stuff.

----------

## Gnufsh

Thanks for chcking that one. I believe functions are aligned to 4 by default on x86. Right now, I've got -falign-functions=5. One of the freebench benchmarks does best with functions aligned to 64.

----------

## wrc1944

Hmmmm.... This gets more and more confusing.

According to man gcc, the -mcpu=athlon-xp flag is not redundant, and without 

specifially using it, the compiler will not generate code which will not run 

on the i386, even with the -march=athlon-xp flag included. Apparently, this 

means that with only the -march=athlon-xp flag specified, gcc omits some 

features specific to the athlon-xp cpu. I assume this would hold true for any 

specific cpu.

I've also read other places that unless you specifically add the -msse, -mmmx, -m3dnow flags, they won't be included, which in a way seems to reflect what the man gcc info says. Apparently, even though we all have thought  -march=athlon-xp implies all the other flags, without specifying them and the cpu individually, it doesn't generate code that won't run on the i386. At least that's my understanding.

I certainly don't know firsthand myself- I just try and understand what I'm reading, and draw logical conclusions.

wrc1944

----------

## TheCoop

this is very confusing...

typical oss/gpl documentation...

I wonder if I should poke around the code and gcc irc channels and ask a few ppl?

----------

## Gnufsh

-march=athlon-xp generates code that uses mmx, sse, and 3dnow without any additional flags needing to be specified. Code with -march=athlon-xp will not run correctly on a machine without these instructions (I tried on a machine that didn't support sse... anything that used sse failed to work properly. Appearently my laptop doesn't support sse even though the processor (athln-xp thoroughbred) does. I think the chipset or bios is responsible). -mcpu=athlon-xp generates code that will run on a i386.

----------

## kappax

so end the end how do i get sse 3dnow and mmx on my XP1600+ ?

----------

## Gnufsh

-march=athlon-xp should be all you need. It enables the macros that generate sse, mmx, and 3dnow code.

----------

## kappax

 *Gnufsh wrote:*   

> -march=athlon-xp should be all you need. It enables the macros that generate sse, mmx, and 3dnow code.

 

so what is this going to do for me ?

```

CFLAGS="-march=athlon-xp -O3 -fomit-frame-pointer -pipe -ffast-math -fprefetch-loop-arrays -funroll-loops -finline-functions -falign-jumps=4 -falign-loops=4 -falign-functions=64  -fforce-addr -mmmx -msse -m3dnow -mfpmath=sse,387"

```

----------

## pagal

Hi,

I'm going to do an emerge -e world and before that I thought I should optimize my CFLAGS as well as USE FLAGS...can anyone help?

--------------------------------------------------------------------------------------

cat /proc/cpuinfo

processor       : 0

vendor_id       : GenuineIntel

cpu family      : 6

model           : 8

model name      : Pentium III (Coppermine)

stepping        : 6

cpu MHz         : 866.708

cache size      : 256 KB

fdiv_bug        : no

hlt_bug         : no

f00f_bug        : no

coma_bug        : no

fpu             : yes

fpu_exception   : yes

cpuid level     : 2

wp              : yes

flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr

 pge mca cmov pat pse36 mmx fxsr sse

bogomips        : 1730.15

--------------------------------------------------------------------------------------

USE="#USE="X gtk gnome -alsa"

USE="gnome -kde -qt arts -nls python perl oggvorbis opengl sdl -postgres jpeg png truetype xml xml2 dvd avi aalib mpeg encode fbcon mmx"

--------------------------------------------------------------------------------------

I use gnome and also use arts instead of alsa.

any help would be appreciated.

Thanks.

----------

## TheCoop

add 'sse' to the cflags

----------

## Gnufsh

 *Quote:*   

> 
> 
> CFLAGS="-march=athlon-xp -O3 -fomit-frame-pointer -pipe -ffast-math -fprefetch-loop-arrays -funroll-loops -finline-functions -falign-jumps=4 -falign-loops=4 -falign-functions=64  -fforce-addr -mmmx -msse -m3dnow -mfpmath=sse,387"

  should be rather fast. As we've pointed out, -march=athlon-xp enables mmx, sse, and 3dnow, so there is no point in specifing -mmmx -msse and -m3dnow. I don't think they do any harm, tho. -ffast-math might cause problems for anything that needs accurate math. I just recompiled with : CFLAGS="-march=athlon-xp -O3 -fomit-frame-pointer -pipe -fprefetch-loop-arrays -funroll-loops -falign-jumps=4 -falign-loops=4 -falign-functions=5 -fforce-addr" and everything seems fine so far. 

pagal: -march=pentium3 -O3 -fomit-frame-pointer -pipe is a start, I don't know how your processor will fair with the other settings. You may even drop back to -O2, because of the smaller L1 cache (as compared to the athlon, which benefits more from function inlining). -fprefetch-loop arrays will probably help (possibly more than it does on the athlon). I just checked, and -march=pentium3 enables -D__SSE__ and -D__MMX__, so your sse and mmx instructions should get used without any extra flags (other than -march=pentium3)

wrc1944: I think you have it backwards. -mcpu=athlon-xp will generate code optimized for an athlon-xp, but still able to run on an i386. -march=athlon-xp implies -mcpu (according to both the docs and my testing), while also enabling features that break support for other cpus (mmx, sse, 3dnow, etc.) 

edit: for some reason [/quote] magically appoeared at the end of my message. Why? My quote is closed? Where did it come from? What does it want?

----------

## kappax

 *Gnufsh wrote:*   

>  *Quote:*   
> 
> CFLAGS="-march=athlon-xp -O3 -fomit-frame-pointer -pipe -ffast-math -fprefetch-loop-arrays -funroll-loops -finline-functions -falign-jumps=4 -falign-loops=4 -falign-functions=64  -fforce-addr -mmmx -msse -m3dnow -mfpmath=sse,387"  should be rather fast. As we've pointed out, -march=athlon-xp enables mmx, sse, and 3dnow, so there is no point in specifing -mmmx -msse and -m3dnow. I don't think they do any harm, tho. -ffast-math might cause problems for anything that needs accurate math. I just recompiled with : CFLAGS="-march=athlon-xp -O3 -fomit-frame-pointer -pipe -fprefetch-loop-arrays -funroll-loops -falign-jumps=4 -falign-loops=4 -falign-functions=5 -fforce-addr" and everything seems fine so far. 
> 
> pagal: -march=pentium3 -O3 -fomit-frame-pointer -pipe is a start, I don't know how your processor will fair with the other settings. You may even drop back to -O2, because of the smaller L1 cache (as compared to the athlon, which benefits more from function inlining). -fprefetch-loop arrays will probably help (possibly more than it does on the athlon). I just checked, and -march=pentium3 enables -D__SSE__ and -D__MMX__, so your sse and mmx instructions should get used without any extra flags (other than -march=pentium3)
> ...

 

wee, I droped the flags so now i have.

```

CFLAGS="-march=athlon-xp -O3 -fomit-frame-pointer -pipe -ffast-math -fprefetch-loop-arrays -funroll-loops -finline-functions -falign-jumps=4 -falign-loops=4 -falign-functions=64  -fforce-addr -mfpmath=sse,387"

```

oh and was reading on use, seemed that X was not using sse ro mmx, but now it does

```

USE="-3dfx 3dnow mmx sse alsa cups kde gnome opengl samba"

```

----------

## Gnufsh

Yeah, you should put sse, mmx, and 3dnow in your USE="...", for some reason xfree only builds with those if they're in the USE variable for some reason.

----------

## xaviorm

Since I'm now completely confused. What should my CFLAGS be? My cpu info flags are:

flags           : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm

also should I be enabling acpi in the kernel? Any place I can go to read up on what all these flags are an how to utilize them?

----------

## defconfoo

Actually, that's incorrect.  All you need is -march=<proc> OR -msse -mmmx, etc., for the compiler to "allow" for those types of instructions.  Also, you need to set -mfpmath=sse for those instructions to be created automatically.

If you want to test me, go for it.  :)

Compile a program with some C code and just use the following flags:

-march=pentium3 OR pentium4 OR athlon*

-mfpmath=sse

-O2

You don't need to specify -msse or -mmmx.

----------

## defconfoo

Also, a lot of people put optimizations in their cflags that aren't necessary at all.

-march=<proc> (BTW, this is the minimum processor the compiled code will run on) This would activate the flags your processors supports INCLUDING -mmmx, -msse, -msse2, -maltivec, etc.  It may not show them with gcc -v -Q tests or it even may show something like -mmmx -mno-mmx, but they are nevertheless still active and code is still generated for those instructions.

-mcpu=<proc> (BTW, this option specifies to optimize for this processor but still support execution on other processors) This would most likely NOT active flags such as -mmmx, -msse, -msse, -maltivec, etc., unless the processor for march specifies so...

-mfpmath=sse or sse,387

The former generates sse code (note that either march OR msse is also required) for floating point code.  The latter generates instructions for both functional units, but I've analyzed the code thoroughly, and it sometimes behaves funny.  Trust me, if you are using a pentium3 or 4, you want to stay as far away from the ordinary 387 fpu as much.  Leave it for specialty instructions, because it just doesn't compare.  For athlons, i'd use sse,387 instead.

-malign-double or -m128-bit-something-aling :p

Stay away from these options please! They break code left and right. Ever get error such as every file being the same size in TERABYTES!? It's most likely due to this flag...

-mno-push-args

You could specify this. It's not going to make that much of a difference in speed. It reduces dependencies as opposed to a series of push instructions, BUT it greatly decrease decoding bandwidth. Personally, I'd stick with the pushes. Why? Because on the pentium3, the 4-1-1 decoding rule makes a series of complex mov instructions prohibitive.  It'll only decode one per clock cycle, so that shoots that right there.  Second, on the pentium4, the lack of specialized address generation units means that all such instructions are decomposed into micro-ops anyway, and the way in which the p4 breaks down such a mov as opposed to a push makes the push more efficient.

-maccumulate-args

On all the x86 systems I've tested, this doesn't do a damn thing. :p It doesn't matter, -fdefer-pop will accomplish a similar thing, but it's automatically enabled with -O2, -O3, and -Os...soooo.... ;)

-mpreferred-stack-align

Leave this option alone. The -O series of flags take care of it.

I'm gonna do a nother post on the -f series of options...

I hope this helps. :)

----------

## defconfoo

First and foremost, avoid SSA optimizations.  They're not ready for system building yet.  Also, you don't kneed to specify -fmove-all-movables or -freduce-all-givs.  These options perform a similar optimization as -fstrength-reduce, but the latter is much better and is automatically enabled at -O2, -O3, and -Os (for a reason ;D).

As far as the other -f flags go, only a few are not activated by the -O series of optimizations.

-Os enables all important optimizations, plus performs an extra pass to replace certain groups of instructions with smaller instructions that perform the same task. This is actually a very good flag. It optimizes well, and I'd use it, especially on very large libraries/kernel.

-O2 enables the same optimizations as -Os, but does not perform the extra instruction-compact pass. Most importantly, -O2 enables alignment.  This includes stack alignment, function alignment, jump alignment, loop alignment, and label alignment (Thanks Bedeox).  That means specifying the alignment for these flags is unnecessary. -O2 and -O3 will automatically align the aforementioned to their defaults (which are very good, and tuned to the cache-line-length of the processor, among other smaller details, so do NOT change these unless you know what you're doing).

-O3 enables everything in -O2 in addition to -frename-registers and -finline-functions.  This inlines ALL functions that reach certain heuristically defined criteria.  (Note that -O2 also inlines functions, but only those that have the inline keyword in their prototype) Use -finline-limit to control the amount of inlining (I'd stick with default, it's there for a reason). :)

-fomit-frame-pointer

For x86, you need this, because -Ox doesn't enable it by default.

-ffast-math

For x86, you *might* want this, because -Ox doesn't enable it by default.  For non-critical applications, go for it. :) BTW, this option enables 3 other -f optimizations. If you use -ffast-math, you don't need 'em.

-fprefetch-loop-arrays

This speeds up execution somewhat for large arrays on platforms that support.  I'm not entirely sure, but I'm almost positive it only works on machines with SSE support. (P3/4, Athlon4/XP)

-fmerge-all-constants

This reduces the size of your data and text segments by a very small amount, but it helps, so why not? It eliminates redudancy, but it is non-ANSI-C compliant. Don't worry about that, turn it on if you want it. :p

As far as all the alignment options go, let the -Ox flags control it.  They know the best values to use. If you curious about other optimizations, trust me, the chances are higher that it'll break some package in your system more than it will increase overall system speed by more than 1-0.5%.Last edited by defconfoo on Mon Apr 14, 2003 3:01 am; edited 1 time in total

----------

## defconfoo

Oh yeah, what the guy above said about putting sse, mmx, and/or 3dnow in your use flags, you should.  Some packages have specific, highly optimized routines which utilize these instructions that are only enabled at compile time but are not generated automatically by the compiler.

But I rreeaaally  got to run now...I'm late for class. :-p

----------

## Lovechild

defconfoo... that was an awesome walkthrough..

by any chance does your studies have anything to do with compiler design  :Wink: 

----------

## defconfoo

Hehe...thanks. :)

Actually, I'm into computer architecture.

Oh, I forgot to write about an important flag because I was in a rush:

-funroll-loops or -funroll-all-loops

These flags are overrated.  I've studied what the gcc compiler does, and it in no way unrolls loops in an efficient manner.  For instance, if you take the following loop:

int total = 0, *array....

for (int i = 0; i < yadyada; i++) {

  total += array[i];

}

GCC will strength reduce this and result in a very well optimized, tight loop, but if you enable unrolling, all it will do is the equivalent of this in assembly (if yadayada is not known :p):

for (int i = 0; i < yadayada; i++) {

  total += array[i++];

  if (i >= yadayada) break;

  total += array[i];

}

This offers absolutely NO speed up. In fact, it would slow things down because the loop would take up more space in the cache. Sometimes, unrolling loops is beneficial...like if the number of iterations is known AND is small (which most of the time, this isn't the case). GCC has a tendency to unroll the entire loop and not change it into a larger loop without as many iterations...for instance, this would yield a higher degree of parallelism

for (int i = 0; i < yadayada; i++) {

  total1 += array[i++];

  total2 += array[i];

}

total = total1 + total2;

It doesn't do that, and as far as I know, it isn't capable of doing that.  This is a really silly example, but I think it makes the point. :\ I think that's why the put the warning in the manual about it slowing down code.

For a stupid story, one time I built a system using -O3 -finline-limit=1200 -unroll-all-loops.  Just booting into X, running GAIM and Mozilla took up ~180 megabytes of memory :p.  It wasn't pretty...

Alrighty, I'm tired...too many all-nighters in a row are killing me. :p

Got to nap... :D

----------

## Bedeox

Welcome all!

@defconfoo: -O2 enables label alignment - from GCC manpage: http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

-Os isn't so good for libs if you have >=256 MB ram and reasonably fast HDD

even if you run lots of apps simultaneously (most of the code will be shared)

@kappax: why do you specify -falign* flags? They are properly set (read fastest) using -mcpu and/or -march defaults.

----------

## defconfoo

Oh...ok.

On my system, when I compile with -O2 or above, functions and loops are byte aligned 16 (which is the best for x86), but in the assembler output, all jumps related to if's and gotos are not aligned.  Maybe -O2 enables alignment, but for x86, the alignement default is 1?

----------

## floam

defconfoo: That walkthrough was excelent, you should post it up somewhere on the web where more people can read it.

----------

## ERW1N

great info defconfoo  :Wink: 

so, which one do you think is better? -O2 or -O3 ?

since -O3 only turns on 2 flags: -finline-functions and -frename-registers, and from above post somebody said that -finline-functions is good and -frename-registers is not suitable for x86 arch....

and how bout

-falign-jumps

-falign-loops

-falign-functions ?

----------

## Bedeox

-O3 makes compilation much more memory intensive

and it might create a problem with some apps

(due to inlining, counter it with -fno-inline-functions)

It is faster on CPUs with large cache,

but might be slower on the ones with smaller cache

-frename-registers won't do much on x86, but it will help performance

@defconfoo: Which march/mcpu are you using?

----------

## defconfoo

-march=pentium3 for my home computer

-march=athlon-xp (it's actually an Athlon 4) for my laptop

But I didn't compile everything from scratch on my laptop.  I gave up after 4 days. =)

----------

## Gnufsh

So functions are aligned to 16 by default on x86, is that the optimum for the athlonxp, with its big L1 cache? Should they be aligned to 64byte boundaries?

----------

## taskara

 *Gnufsh wrote:*   

> So functions are aligned to 16 by default on x86, is that the optimum for the athlonxp, with its big L1 cache? Should they be aligned to 64byte boundaries?

 

if that is the case, then surely it should be set to 64 bytes!

so therefore amd users should add the CFLAG 

```
 -falign-functions=64
```

 to make.conf

agreed ?

----------

## ghetto

 *taskara wrote:*   

> if that is the case, then surely it should be set to 64 bytes!
> 
> so therefore amd users should add the CFLAG 
> 
> ```
> ...

 

Does that go for all AMD users? Or just amd-xp.

I have just a plain amd athlon (not thunderbird) ..what would I set it to?

cat /proc/cpuinfo

vendor_id       : AuthenticAMD

cpu family      : 6

model           : 4

model name      : AMD Athlon(tm) Processor

stepping        : 2

cpu MHz         : 1009.000

cache size      : 256 KB

----------

## taskara

that should go for all athlons because they all have 64kb level 1 cache  :Smile: 

I think even durons have that

----------

## ghetto

Ok thanks, but I have one more question.. I know that at the begining of this thread the idea of adding flags like -mmmx -m3dnow etc etc was HIGHLY encouraged. 

Is that still the case? Or has it been established that doing so is not really nessisary.

Here are my current: CFLAGS="-march=athlon -O2 -mmmx -m3dnow  -falign-functions=64 -pipe"

----------

## taskara

I'm not sure.. I was under the impression that -march=athlon-xp automatically entered all those flags.

however someone posted that it doesn't.

but then someone said that putting -march=athlon-xp -mmmx -m3dnow -msse actually disabled them because it was already enabled in -march=athlon-xp ...

so the short answer?

I'm still confused.

I leave them out, but put them in my USE flagset

it would be GREAT if a dev could confirm this. ..  :Wink: 

----------

## ghetto

Whoa.. ok so EITHER they are already in and it doesnt do anything OR putting those flags in actually disables those registers?!?! Eeep!  :Shocked: 

Ok Im removing those flags now.. dang that sucks.

----------

## Gnufsh

If I compile with -march=athlon-xp, sse, 3dnow, and mmx are enabled (through the -D__athlon_sse__ -D__tune_athlon__ -D__tune_athlon_sse__ -D__SSE__ -D__MMX__ -D__3dNOW__ -D__3dNOW_A__ macros). When I add, for example -mmmx, -mno-mmx appears after -mmmx in the "options enabled" list in the output of gcc -Q -v -march=athlon-xp -mmmx. However, -D__MMX__ doesn't go away, so MMX is still used. In short -mmmx, -msse, and -m3dnow are unneccessary, but they don't hurt.undefined

----------

## ghetto

Ok thanks, thats good to know. Its pretty painful to recompile an entire system on a 1ghz cpu.   :Evil or Very Mad: 

----------

## taskara

thanks gnufsh  :Smile: 

I think I read on the other post that someone showed how adding -mmmx and -msse etc actually caused them to become -mnommx and -mnosse because they were already enabled in -march=athlon-xp.

so leaving them in won't hurt.. sweet as. ta..

----------

## Gnufsh

-mno-mmx, -mno-sse, and -mno-3dnow do show up on the "options enabled" part, but the macros that actually impliment the mmx, sse, and 3dnow support are still enabled. And, sse, mmx, and 3dnow are still enabled.

----------

## taskara

hmm ok..

so adding -mmmx, etc to -march=athlon-xp makes them look disabled, but they are in fact enabled still ?

crazy

we need to get someone from gnu to clarify optimum settings!  :Wink: 

----------

## wrc1944

I agree- if some clarification from THE definitive source was forthcoming, that would be great, although it's hard to recognize what that source is. I guess this is what makes Linux interesting, but it would be nice if what seemingly was a straightforward procedure wasn't so ambiguous. Realizing different hardware is involved in each case, it would seem one could find out exactly what gcc optimizations actually do, or how they interact- but that knowledge is so far, elusive. Maybe it's just that nobody really knows, or  the gcc manual has a few errors itself, and leads us astray? We can get conflicting reports, all from sources who obviously know more than we do, but what are we to make of it?

As one who has spent much time trying to optimize to the max, I now realize I don't know what to take as gospel, though I surely appreciate all the help I've gotten.  At this point, I'm realizing I simply have to educate myself, and then do it myself, and see what happens. Apparently, when you're on the edge with your own hardware, there is no other way.

wrc1944

----------

## drdabbles

Optimizations are like...well...I can't think of anything they are like.  I can only say that optimizations are like recursion.  "To understand recursion, you must first understand recursion".

One suggestion I would give to anyone with about 10 minutes of ree time, would be to grab the GCC source from GNU, unpack it, and grep the resulting directory for "i586", or some other optimization flag.  Heck, you could even grep for "march" if you wanted to.

This will allow you to find out what file contains the optimization interpretation code.  You can then open it in your favorite viewer/editor, and peruse to your hearts content.  It is pretty straight-forward.  You will be able to see what flags are enabled for which CPUs, etc.

With regard to getting anything from developers...most of them are just as confused as we are.  There is no clear, concise documentation that discusses the effects of optimization flags...anywhere.  There are people's opinions, and people that look at the macros, etc.  I choose to believe the latter of the two, but again, they are just opinions as to the effects.

Also, I would like to say that it might be difficult to get an official response from the developers that work(ed) on the GCC compiler.  They are an open source community, and many times it is difficult to pin down the specific person or group responsible for a particular segment of code.  However, perhaps someone could try anyway.

Good luck all,

Thomas Cameron

CEI Systems, Inc.

----------

## taskara

this is a pretty simple issue tho, surely someone who knows programming can test it and work it out ?

we aren't asking what optimisations do, we simply want to know if "-march=athlon-xp" enables "-mmmx, -m3dnow, and -msse" by default.

if it doesn't then we need to put them in our CFLAGS.

if it does, then we don't need to put them in our CFLAGS.

some people have said, put them in anyway. but others have said if you do that and -march DOES automatically put them in, then you will actually DISable them ...

it's all too confusing a situation, but surely there is a clear answer, either -march=athlon-xp DOES autmatically incorporate those flags or it doesn't.

surely someone knows that?

----------

## ghetto

/me goes and writes stallman a letter.. 

Dear Mr Stallman,

I know your a pretty busy guy with your.. uh.. hmm.. well whatever it is that

your doing, but could you please explain to us how gcc works?

Particularly I would like to know how I can make gcc read my email.. oh wait.. thats Emacs, well then, uh.. could you please explain how to make gcc do proper optimizations? Its very important to me that I run the absolutly fastest binary code imaginable and if I have even the slightest doubt that my code is not completely optimized I break out in a horrid rash. Im sure you understand where Im coming from.

Any help would be appreciated.

Sincerely

Gnu/Ghetto

----------

## taskara

 *ghetto wrote:*   

> /me goes and writes stallman a letter.. 
> 
> Dear Mr Stallman,
> 
> I know your a pretty busy guy with your.. uh.. hmm.. well whatever it is that
> ...

 

Dear Mr Clever,

yeeeees... 

funny thing is he probably _would_ understand   :Razz: 

on a serious note, I think this is a handy thing to know.

You obviously don't, but that's ok.

Sincerely,

a guy trying to help other linux users who asked the question.

----------

## ghetto

ok so maybe i was a touch on the sarcastic side with that last comment.

i was only having fun, I honestly think this is pretty important as well.. but I just couldnt help but try to joke because in a way its kind of funny and my little imaginary email is true, some people really do break out in rashes if they think their code might not be optimized to the absolute degree..

 :Smile: 

----------

## taskara

gr00vy man... I didn't wanna come downl on you hash  :Wink: 

seriously tho.. an answer to the problem woudl be cool!  :Wink:  hehe

----------

## ghetto

Why dont we send some real email then?

----------

## taskara

cause don't wanna bother them  :Wink:  they have more important things to do ehe  :Smile: 

----------

## nico--

 *taskara wrote:*   

> cause don't wanna bother them  they have more important things to do ehe 

 

Obviously, good documentation isn't important.

Look at gentoo... advanced command line installation but the good documentation makes it _much_ easier.

----------

## m00dawg

I tried -mfpmath=sse,387 and then ran Pov-Ray. The results showed no real change in the results - two of the three results were even slightly slower. I don't know how this might impact other benchmarks, but for Pov-Ray it seems to make no difference. This is unfortuante since you would think the additional registers would help - if nothing else, temporary data could be placed there.

 *Malakin wrote:*   

> 
> 
> I doubt using -mfpmath=sse,387 makes any actual performance difference with anything, someone please prove me wrong.

 

----------

## m00dawg

Perhaps it would be a good idea if someone were to post up some benchmarks using different flags? I have already commented on Pov-Ray benchmarks vs using mfpmath=sse,387, but it would be interesting to see others.

An easy benchmark to run is timing a kernel build. It easy, reasonably fast, and the kernel is already there for you to play with  :Smile: 

----------

## _Edulix

Hi all!

I've read the whole thread, and I have somethings to say.

There' some people that uses very large CFLAGS, but don't know really what does their 

options really do. I have been one of them yesterday when compiled my new gentoo system, but now I'm going to change my flags for some reasons.

Thanks you defconfoo, you have been very helper for me! 'Ive read in http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html the functions which are not included by -Ox, and there some of them about we must read before adding them to our CFLAGS:

 *Quote:*   

> 
> 
> -fomit-frame-pointer
> 
>     Don't keep the frame pointer in a register for functions that don't need one. This avoids the instructions to save, set up and restore frame pointers; it also makes an extra register available in many functions. It also makes debugging impossible on some machines.
> ...

 

Finally, this are the CFLAGS I've selected for my two gentoo machines:

Machine 1 with a Athlon XP 2000+, concretly:

 *Quote:*   

> 
> 
> # cat /proc/cpuinfo
> 
> processor       : 0
> ...

 

CFLAGS="-march=athlon-xp -O3  -fomit-frame-pointer -pipe-falign-functions=64 -mfpmath=sse,387 -m3dnow -msse -mmmx -ffast-math"

Machine 2, with a Celeron (Coppermine) 900 Mhz,

 *Quote:*   

> 
> 
> processor       : 0
> 
> vendor_id      : GenuineIntel
> ...

 

CFLAGS="-march=pentium3 -O2 -pipe -fomit-frame-pointer -falign-functions=16 -ffast-math -m3dnow -msse  -mmmx -mfpmath=sse,387"

What do you in adding -mfpmath=sse,387 for a celeron coppermine? does all of we actually need -fomit-frame-pointer if we already have -Ox? Would you change (add, delete..) any option of my my CFLAGS ?

Thanks you all,

                      Edulix.

----------

## wrc1944

After reading man gcc many times, and going over all the info on the Gentoo forum's Cflags Central thread (about 20 pages, and very informative), and every other forum and info source I could find for months, I finally settled on what I thought was the best set of optflags for athlon-xp platforms. Here they are: 

optflags: athlon -O3 -fomit-frame-pointer -pipe -march=athlon-xp -mmmx -msse -m3dnow -falign-functions=16 -falign-labels=1 -falign-loops=16 -falign-jumps=16 -fprefetch-loop-arrays -mfpmath=sse,387 -ffast-math -fforce-addr 

I won't go into details about why I included, or removed specific flags here. I did try these on XFree86, and they built and installed fine, with only a few warnings and no errors, and after five days, no apparent problems have surfaced, and fine performance. Curiously, the XFree86 compile dropped the -ffast-math flag, but other packages keep it.

For what it's worth, this was done on Mandrake 9.1, as It's almost impossible to really utilize Gentoo's advantages on a 56k dialup connection without at least using the wvdial "resuming downloads" function (I must share the one phone line with others, so I never get more than 1-2 hours at a time). By the time I downloaded the equivalent of emerge world on dialup without the option of leaving my system on overnight, it would be obsolete.

wrc1944

----------

## xedx

i read it is advised to add -falign-functions=64 to your CFLAGS if you have an athlon/duron, does it make any difference on a pentium4? 

btw how 'bout hyperthreading. I have a pentium4 with [ht] and another one without. ICC does have some optimizations on [ht] IIRC 

Any thoughts

----------

## TheCoop

 *wrc1944 wrote:*   

> optflags: athlon -O3 -fomit-frame-pointer -pipe -march=athlon-xp -mmmx -msse -m3dnow -falign-functions=16 -falign-labels=1 -falign-loops=16 -falign-jumps=16 -fprefetch-loop-arrays -mfpmath=sse,387 -ffast-math -fforce-addr

 Just recompiled the world with these CFLAGS, nothing broke and it seems slightly faster... (except you dont need -m3dnow, -mmmx or -msse as -march=athlon-xp enables those anyway)

----------

## wrc1944

TheCoop,

Glad to here those flags work on Gentoo also. In all my reading, I ran accross many conflicting opinions about whether to include --mmmx, -m3dnow, and -msse. Some implied -march=athlon-xp did not in all cases automatically activate those opts. I decided to err on the side of caution, and add them, even if they aren't really needed with -march=athlon-xp.

I'd sure like to know one way or the other, but many persons who obviously knew more than I did felt you should include them.

wrc1944

----------

## bkeating

I don't quite understand the construction of this line... Im running a Pentium 4 (3.06Ghz) and these are the flags it gives me;

```
fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
```

I see a lot of guys here running P4's as well, where do they come up with "-fomit-frame-pointer" n such?  Am I missing the format structure?

Would this be correct for me;

```
 

-march=pentium4 -03 -pipe -fpu -vme -de -pse -tsc -msr -pae -mce -cx8 -apic -sep -mtrr -pge -mca -cmov -pat -pse36 -clflush -dts -acpi -mmx -fxsr -sse -sse2 -ss -ht -tm

```

or can only a few be used?

----------

## puddpunk

 *bkeating wrote:*   

> I don't quite understand the construction of this line... Im running a Pentium 4 (3.06Ghz) and these are the flags it gives me;
> 
> ```
> fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
> ```
> ...

 

Hi there bkeating.

You don't really have the grasp of it, but I'll be glad to point you in the right direction  :Wink: 

The first set of "flags" you pasted, were the CPU flags. It is a list of all the features your CPU supports, performance enhancing or not, it's basically everything that exists on your processor (i.e. mmx, sse, see2, ht (hyper-threading)).

The second lot of "flags" are Compiler flags. They are instructions to the compiler to tell it to compile code a certain way. Most flags start with a letter (i.e. -m, -f or -O).

-O means optimise (1,2,3 or s (small)). These flags turn on a whole lot of other flags (such as -fomit-frame-pointer (telling the compiler to omit the frame pointer, used for debugging only, so it frees up an extra CPU register).

-f options, such as -fomit-frame-pointer, or -ffast-math, tells the compiler to compile code a certain way. Most of the time it is to optimise.

-m options can be thought of as "feature options". Such as -mmmx turns on mmx support, -msse or -msse2 turns on sse(2) support in the compiled code.

You have to take the flags that /proc/cpu gives you and "translate" them into flags that gcc understands, so it can tailor it's code to your CPU. Thats what this thread is about. Also, there is another thread in the Portage & Programming forum about CFLAGS, called CFLAGS central, which would probably help you alot.

Hope this helps,

Chris.

----------

## bsolar

 *puddpunk wrote:*   

> -O means optimise (1,2,3 or s (small)). These flags turn on a whole lot of other flags (such as -fomit-frame-pointer (telling the compiler to omit the frame pointer, used for debugging only, so it frees up an extra CPU register).

 

It's enabled only if the arch supports debugging without it, so on x86 -fomit-frame-pointer is not enabled with any -O (you must specify it if you want it).

----------

## dgrant

from freehackers.org:

 *Quote:*   

> 
> 
>  Athlon-tbird, aka K7 (AMD)
> 
> CFLAGS="-march=athlon-tbird -O3 -pipe -fforce-addr -fomit-frame-pointer
> ...

 

It says that 3dnow and mmx optimization are implied by athlon-tbird?

----------

## xedx

 *dgrant wrote:*   

> from freehackers.org:
> 
>  *Quote:*   
> 
>  Athlon-tbird, aka K7 (AMD)
> ...

 

that's why u dont need to put them (eg. -m3dnow) when you have -march={athlon,etc}

----------

## Radea

Athlons have 128k L1 cache (I think it is Durons with 64), therefore would it not be better to use '-falign-functions=128' instead of '-falign-functions=64'

----------

## taskara

 *Radea wrote:*   

> Athlons have 128k L1 cache (I think it is Durons with 64), therefore would it not be better to use '-falign-functions=128' instead of '-falign-functions=64'

 

athlons have 64k level1 cache, and 256mb level 2 cache (barton athlons have 512), whereas duron's have 64k level 1 cache, and 128k level 2 cache.

they both have the same level 1 cache size, so '-falign-functions=64' is correct for both cpus.

[img:6b6c26e30b]http://www.penguinitis.com/images/athloncache.jpg[/img:6b6c26e30b]

----------

## Radea

 *taskara wrote:*   

>  *Radea wrote:*   Athlons have 128k L1 cache (I think it is Durons with 64), therefore would it not be better to use '-falign-functions=128' instead of '-falign-functions=64' 
> 
> athlons have 64k level1 cache, and 256mb level 2 cache (barton athlons have 512), whereas duron's have 64k level 1 cache, and 128k level 2 cache.
> 
> they both have the same level 1 cache size, so '-falign-functions=64' is correct for both cpus.
> ...

 

Why am I thinking 128 then?  :Sad:  Im also thinking 384K total cache for non-Bartons, maybe adding AMD was adding the I-Cache and D-Cache as a marketing number?   :Confused:  Either that or Im just going completly crazy   :Laughing: 

Edit

"All Athlon XP processors (including Barton) contain a 128K L1 cache, 64K for data, and 64K for instructions." |LINK|

That must be whats getting me.   :Razz:  So it is 128K L1 cache but there are two types, or?   :Embarassed: 

----------

## taskara

hehe what's probably getting you is that athlons can COMBINE their level 1 and level 2 cache, where as pentium's can't, they remain seperate  :Smile: 

----------

## pi-cubic

first: wow, there is much info in this thread and i feel quite overwhelmed by it...and i'm not even a native english speaker  :Wink: 

second: reading this thread and other sources, i finally have the following CFLAGS for my Intel Pentium 4M (laptop) machine:

```
CFLAGS="-march=pentium3 -mcpu=pentium4 -O3 -finline-functions -falign-jumps=5 -falign-loops=5 -falign-functions=64 -pipe"
```

it would be a great help for me, if anyone could tell me if i included a very stupid bug. thank you guys...

pi-cubiq

----------

## taskara

 *pi-cubiq wrote:*   

> first: wow, there is much info in this thread and i feel quite overwhelmed by it...and i'm not even a native english speaker 
> 
> second: reading this thread and other sources, i finally have the following CFLAGS for my Intel Pentium 4M (laptop) machine:
> 
> ```
> ...

 

the only problem I can see is that it should read  *Quote:*   

>  -march=athlon-xp

 

 :Twisted Evil: 

----------

## pi-cubic

 *taskara wrote:*   

> the only problem I can see is that it should read  *Quote:*    -march=athlon-xp 
> 
> 

 

i'm sorry, but i can't follow you   :Sad: . do you mean, that my cflags-settings would be for an athlon-xp? what do you mean by 'it should read'?

----------

## MOS-FET

ok so i've almost read every post in this topic, and i finaly chose these cflags/cxxflags for me (i've got a athlon-xp):

"-march=athlon-xp -O3 -pipe -m3dnow -mmmx -msse -mfpmath=sse,387 -finline-functions -fmerge-all-constants -fthread-jumps -fomit-frame-pointer -fexpensive-optimizations -ffast-math -fforce-addr -falign-functions=64 -falign-jumps=4 -falign-loops=4 -frerun-cse-after-loop -frerun-loop-opt -fprefetch-loop-arrays -maccumulate-outgoing-args"

i've just compiled a few packages with that, and everything seems to work fine. i'll do an emerge -e world this night and see what happens tomorrow :-) do i remember right that those cflags do NOT apply when making a new kernel?

do you have any suggestions about these cflags? did i miss something or should i remove something? this whole cflags thing is damn confusing, i mean, there must be someone out there who knows that they all do and if it's a good idea to use them or not ...

tom

----------

## esapersona

I have an athlon-xp and I had a few problems with fast-math...Mainly on the emerge system, so you may need to muddle around with those packages if you want to use fast-math...

----------

## MOS-FET

well my emerge -e world just finished, and everything is just working fine. no problems at all yet, neither at compilation nor when using the system.

----------

## esapersona

Great!  Perhaps I'll have to look into using those CFLAGS....Perhaps it was some combination that I used

----------

## MOS-FET

well -ffast-math and -mpfmath= seem like they have to do with each other, i don't know.

----------

## MOS-FET

ok, so i finished compiling all packages with the CFLAGS i posted earlier and i must say - my system is FEELABLE faster. everything runs really smooth, much smoother than before.

ok here's my hardware data:

athlon xp 2200+ (1800 mhz)

msi f41 mainboard

nvidia nforce2 chipset

my CFLAGS and CXXFLAGS are:

"-march=athlon-xp -O3 -pipe -m3dnow -mmmx -msse -mfpmath=sse,387 -finline-functions -fmerge-all-constants -fthread-jumps -fomit-frame-pointer -fexpensive-optimizations -ffast-math -fforce-addr -falign-functions=64 -falign-jumps=4 -falign-loops=4 -frerun-cse-after-loop -frerun-loop-opt -fprefetch-loop-arrays -maccumulate-outgoing-args" 

as i said, i did an emerge -e world, and i had no problems compiling/running the system, kde, mozilla, k3b, mplayer, xmms, gaim, lmule and a few other apps so far - and it's stunning fast! i wish i had made a benchmark, but it feels much faster really.

tom

----------

## esapersona

Okay - I've changed my CFLAGS to what you have.  I have to try this   :Surprised: 

Seems to be going alright - *yay*

----------

## MOS-FET

hey i just even compiled openoffice 1.1beta2 with the above cflags. that's really surprising me because the ebuild tells you that openoffice is very fragile about aggressive cflags ... but openoffice is so stunning fast now!

----------

## drake51

I am doing an emerge -eUD world right now. I have modified my use/cflags as stated below.  

If this were to fail for some non-critical package, what is the best way to handle it?  Will adding --resume  --skipfirst get past it without recompiling all the prior packages again?

As far for the flags...I based them on the awsome details provided by defconfoo.  Do they look complete?  I have been compiling with them for the past 5 hrs (after I recompiled the kernel and rebooted).

```
snip from cpuinfo....

model name      : Intel(R) Pentium(R) 4 CPU 3.06GHz

flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm

```

```
USE="aaob acpi acpi4linux dvd emacs fbcon gtk2 jikes ofx pda radeon samba sse usb"

CFLAGS="-march=pentium4 -O3 -pipe -mfpmath=sse -fomit-frame-pointer -ffast-math -fprefetch-loop-arrays -fmerge-all-constants -mmmx -msse -msse2"

```

----------

## esapersona

 *drake51 wrote:*   

> If this were to fail for some non-critical package, what is the best way to handle it?  Will adding --resume  --skipfirst get past it without recompiling all the prior packages again?

 

I had a package fail during my current emerge -e....How annoying.  What I did:

```
emerge -ep world > foo

vi foo
```

Edit out the first few lines and the packages already emerged so that you're only left with the list of packages not yet updated (it's in order, so just remember the last one and delete until that...Then type

```
:1,$s/\[ebuild\ \ N\ \ \ \]//g
```

That'll get rid of all the ebuild stuff...Then I went through deleting all ther version numbers (so that I was left with only the package names.  There is probably a way to do that automatically (perhaps something like this?):

EDIT:  I just tried this line - Doesn't work...You need to replace the * with something that means any number of characters....

```
:1,$s/-[0987654321]*\n//g
```

THen write and quit with :wq, and on the command line type:

```
emerge -p `cat foo`
```

 to make sure it's all good...THen do that line again without the -p.  THere is probably a better way, but I took the opertunity to familiarize myself with vim instead of findint it out   :Wink: 

----------

## TheCoop

or you could type

```
emerge -e --resume world
```

----------

## esapersona

Bah - But what do you learn from that? =P

----------

## cchapman

Here are the optimizations per -0#

```
-0

          -fdefer-pop 

          -fmerge-constants 

          -fthread-jumps 

          -floop-optimize 

          -fcrossjumping 

          -fif-conversion 

          -fif-conversion2 

          -fdelayed-branch 

          -fguess-branch-probability 

          -fcprop-registers

          

-O2 

          -fdefer-pop 

          -fmerge-constants 

          -fthread-jumps 

          -floop-optimize 

          -fcrossjumping 

          -fif-conversion 

          -fif-conversion2 

          -fdelayed-branch 

          -fguess-branch-probability 

          -fcprop-registers

          -fforce-mem 

          -foptimize-sibling-calls 

          -fstrength-reduce 

          -fcse-follow-jumps  

          -fcse-skip-blocks 

          -frerun-cse-after-loop  

          -frerun-loop-opt 

          -fgcse   

          -fgcse-lm   

          -fgcse-sm 

          -fdelete-null-pointer-checks 

          -fexpensive-optimizations 

          -fregmove 

          -fschedule-insns  

          -fschedule-insns2 

          -fsched-interblock  

          -fsched-spec 

          -fcaller-saves 

          -fpeephole2 

          -freorder-blocks  

          -freorder-functions 

          -fstrict-aliasing 

          -falign-functions  

          -falign-jumps 

          -falign-loops  

          -falign-labels

-O3 

          -fdefer-pop 

          -fmerge-constants 

          -fthread-jumps 

          -floop-optimize 

          -fcrossjumping 

          -fif-conversion 

          -fif-conversion2 

          -fdelayed-branch 

          -fguess-branch-probability 

          -fcprop-registers

          -fforce-mem 

          -foptimize-sibling-calls 

          -fstrength-reduce 

          -fcse-follow-jumps  

      -fcse-skip-blocks 

          -frerun-cse-after-loop  

     -frerun-loop-opt 

          -fgcse   

     -fgcse-lm   

     -fgcse-sm 

          -fdelete-null-pointer-checks 

          -fexpensive-optimizations 

          -fregmove 

          -fschedule-insns  

     -fschedule-insns2 

          -fsched-interblock  

     -fsched-spec 

          -fcaller-saves 

          -fpeephole2 

          -freorder-blocks  

     -freorder-functions 

          -fstrict-aliasing 

          -falign-functions  

     -falign-jumps 

          -falign-loops  

     -falign-labels

     -finline-functions

     -funit-at-a-time

-frename-registers 
```

----------

## invaderzim

```
-march=pentium3 -mmmx -msse -O2 -fomit-frame-pointer -pipe -mfpmath=sse,387 -mno-push-args -mno-align-stringops -frename-registers -ffast-math -fprefetch-loop-arrays -s
```

implies

```
options enabled:  -fdefer-pop -fomit-frame-pointer -foptimize-sibling-calls

 -fcse-follow-jumps -fcse-skip-blocks -fexpensive-optimizations

 -fthread-jumps -fstrength-reduce -fprefetch-loop-arrays -fpeephole

 -fforce-mem -ffunction-cse -fkeep-static-consts -fcaller-saves

 -fpcc-struct-return -fgcse -fgcse-lm -fgcse-sm -frerun-cse-after-loop

 -frerun-loop-opt -fdelete-null-pointer-checks -fschedule-insns2

 -fsched-interblock -fsched-spec -fbranch-count-reg -freorder-blocks

 -frename-registers -fcprop-registers -fcommon -fgnu-linker -fregmove

 -foptimize-register-move -fargument-alias -fstrict-aliasing

 -fmerge-constants -fident -fpeephole2 -fguess-branch-probability

 -funsafe-math-optimizations -m80387 -mhard-float -mno-soft-float

 -mfp-ret-in-387 -mno-align-stringops -mno-push-args -mmmx -mno-mmx -msse

 -mno-sse -mcpu=pentium3 -mfpmath=sse,387 -march=pentium3
```

These flags are all safe except -ffast-math  but i have had NO problems with it yet on my old flags (-s -march=pentium3 -mmmx -msse -Os -fomit-frame-pointer -pipe -fforce-addr -ffast-math -mpush-args -mfpmath=sse,387 -fschedule-insns2 -fmerge-all-constants)

i emailed GNU about -mmmx -mno-mmx and the sse ones...ill tell you what they say, im hoping its just an error in the output.  

@all:  still dont REALLY know if MMX and SSE really are IMPLIED by -march=  because whats the point of -mmmx and -msse then?

defconfoo:

id like your input on these flags...i think they are the the best flags possible... they use defaults by cpu for all the settings not specified.  What do you think of -ffast-math?  If you think its okay because MANY MANY MANY use it for their whole system with no problems, then what do you think about   

```
-fno-math-errno

    Do not set ERRNO after calling math functions that are executed with a single instruction, e.g., sqrt. A program that relies on IEEE exceptions for math error handling may want to use this flag for speed while maintaining IEEE arithmetic compatibility.

    This option should never be turned on by any -O option since it can result in incorrect output for programs which depend on an exact implementation of IEEE or ISO rules/specifications for math functions.

    The default is -fmath-errno. 
```

errm just answered my question so you all know..

-ffast-math

    Sets -fno-math-errno, -funsafe-math-optimizations,

    -fno-trapping-math, -ffinite-math-only and

    -fno-signaling-nans.

Thanks LAta!

----------

## ph317

A few corrections to some misinfo above:

First off, L1 and L2 caches are seperate, even on athlons.

Also, those caches, whether they're 64, 128, 256, etc... are in Kbytes, not megabytes.

Last but not least, please don't go doing "-falign-functions=64" or any other crazy value like that.  Sane values are things "4".  It's just how many bytes to align the functions by so that jumps are efficient.  jumps, memory access, etc or some processors is just more efficient when aligned on certain boundaries, usually something like 4 bytes, which has nothing to do with L1/L2 cache size.  The only relation between -falign-XXXX=Y and cache sizes is that as you increase the alignment value for faster access, you leave gaps, which means the overall size of the code or data is larger and runs a greater statistical chance of causing cache misses by making things a little further apart.

----------

## Kesereti

But which runs of Athlon chips are T-Birds? ^_^  I'm rather confused about the naming conventions of AMD chips =P

----------

## odegard

 *ph317 wrote:*   

> A few corrections to some misinfo above:
> 
> First off, L1 and L2 caches are seperate, even on athlons.

 

Actually, *only* on athlons.

However, I was thinking, what is the bottleneck on modern computers? I/O. So why don't we optimize the code for smaller footprint than for faster execution? Lets be utterly simplistic and say that there are two variables: LOAD and EXECUTE. LOAD is far bigger than EXECUTE so in order to get a total boost, get LOAD down, even thought it may use longer time EXECUTING.

Agree/disagree?

----------

## ph317

 *odegard wrote:*   

>  *ph317 wrote:*   A few corrections to some misinfo above:
> 
> First off, L1 and L2 caches are seperate, even on athlons. 
> 
> Actually, *only* on athlons.
> ...

 

L1 and L2 are seperate on all processors that have both such things.  They are entirely different types of memory, the L1 is much faster than the L2, and therefore much more expensive per byte and much smaller.  Being different kinds of memory and being attached at totally different places eletrically, they are different.  If the L1 and L2 of a processor were the same, there would be no point in calling them L1 and L2 to begin with, you would just say you had a huge slow L1 or a small fast L2 or something.

On the I/O point, well yes normal tasks on a desktop system these days are more I/O than CPU bound - but they're bound by things like disks, network cards, the net itself, and your keyboard and mouse speed of course - you wouldn't believe how much time the average PC spends twiddling its thumbs waiting on the end user.  In terms of instruction optimizations that this thread is talking about, going from a widely-aligned loop-unrolled fat set of optimizations to -Os and alignments set to zero aren't really making a difference by lowering I/O load per se: if they help, they're help because smaller tighter code keeps more references local to L1 and or L2 cache instead of taking a cache miss and going out to slow main memory.  There's definitely some tradeoffs involved of course.  On a Xeon with a couple megs of L2 cache it's probably not worth it to go -Os, but if whatever x86 clone you're using has like 128k or less of L2, it could very well help.  Benchmark your own CPU running tasks you generally run is the best way to tell.

----------

## odegard

 *ph317 wrote:*   

>  *odegard wrote:*    *ph317 wrote:*   A few corrections to some misinfo above:
> 
> First off, L1 and L2 caches are seperate, even on athlons. 
> 
> Actually, *only* on athlons.
> ...

 

Yes, they are separate entities physcially. What I meant was that in a P4, the caches are INCLUSIVE meaning that everything that is contained in the L1 cache is duplicated in the L2 cache (actually, the P4 has two kind of L1 caches but thats a different story). In an Athlon however, the are EXCLUSIVE. Now perhaps my reply makes more sense. I was talking about separate entities FUNCTIONALLY, while I guess you meant physically...

Anyway, nothing to argue about.

----------

## Gandalf_Grey_

I have an athlon tbird @1.33 ghz. cat /proc/cpuinfo returns this

```

processor       : 0

vendor_id       : AuthenticAMD

cpu family      : 6

model           : 4

model name      : AMD Athlon(tm) Processor

stepping        : 4

cpu MHz         : 1343.062

cache size      : 256 KB

fdiv_bug        : no

hlt_bug         : no

f00f_bug        : no

coma_bug        : no

fpu             : yes

fpu_exception   : yes

cpuid level     : 1

wp              : yes

flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr syscall mmxext 3dnowext 3dnow

bogomips        : 2680.42
```

and my current flags are

-march=athlon-tbird -O3 -pipe -fomit-frame-pointer -mno-push-args -ffast-math -fmerge-all-constants -m3dnow -mmmx -falign-functions=128

does anyone see any blatent errors with this? or places I could improve? I have successfully compiled the gimp and there was a noticable improvment in start time. However would this be sufficient to compile something as picky as OpenOffice?

----------

## higman

I have a tbird @ 1.4, runs well, also doubles as a space heater!

 *Gandalf_Grey_ wrote:*   

> -march=athlon-tbird -O3 -pipe -fomit-frame-pointer -mno-push-args -ffast-math -fmerge-all-constants -m3dnow -mmmx -falign-functions=128
> 
> does anyone see any blatent errors with this? or places I could improve? I have successfully compiled the gimp and there was a noticable improvment in start time. However would this be sufficient to compile something as picky as OpenOffice?

 

I'm using: -march=athlon-tbird -O3 -pipe -fomit-frame-pointer

What flags were you using before and which ones did you add to get this boost? As for your flags... after reading this entire thread and investigating a little on my own...

-m3dnow and -mmmx are redundant (-march=athlon-tbird implies)

-falign-functions=128 is insignifigant and/or dangerous, to the best of my knowledge, the compiler has good defaults for the different cpu (presumably tuned by developers?)

-ffast-math will cause precise calculations to fail

-fmerge-all-constants reduces size by a small amount with no other gain.

-O3 -pipe -fomit-frame-pointer looks good to me. I don't know anything about -mno-push-args though, are the tbirds not stack friendly?

----------

## TeeHee

'nother question here.

Trying to use openmosix on two mashines using different cfags.

anyone had success ? Problems ? Anything ?

----------

## aardvark

 *elektrohirn wrote:*   

> hey i just even compiled openoffice 1.1beta2 with the above cflags. that's really surprising me because the ebuild tells you that openoffice is very fragile about aggressive cflags ... but openoffice is so stunning fast now!

 

Doesn't the openoffice ebuild filter out most flags though?

----------

## higman

 *aardvark wrote:*   

> Doesn't the openoffice ebuild filter out most flags though?

 

yes, it does, here's a segment from /usr/portage/app-office/openoffice/openoffice-1.1_beta2-r1.ebuild:

```
inherit flag-o-matic eutils

# Compile problems with these ...

filter-flags "-funroll-loops"

filter-flags "-fomit-frame-pointer"

replace-flags "-O3" "-O2"
```

----------

## T2

I've read all thread, its really informative (and confusing at moments). 

I'm staying at trusted&tried CFLAGS="-march=athlon-tbird -O3 -pipe" 

for my tbird 1.33ghz.

IMHO critical packages such as kernel (and mplayer   :Laughing:  )  do their own cpu optimisations which are satisfactory. However I'm tempted to try some agressive gcc compile flags to overcome openoffice laziness.

regards

----------

## Gandalf_Grey_

The cflags I mentioned above compiled openoffice fine,a nd it feels noticably more responsive than the binary install, before I changed my flags I had 

-march=athlon-tbird -O3 -pipe

I did some research and it seems my current ones (mentioned above) are about as aggressive as I can get without breaking compiles left and right

----------

## FastTurtle

I've got an XP1800 and these are the flags I'm using.

-march=athlon -m3dnow -mmmx -msse -O3 -pipe. 

 :Crying or Very sad:  Because my last build went south with more aggressive flags, I'm sticking with stability over speed right now because I've got a full gig of ram. Speed isn't a problem that I've noticed except with Open Office taking forever to load.  :Confused: 

As far as this thread goes, I'm real happy to have read the entire thing. Maybe I will begin testing some of the optimizations and seeing what speeds things up, especially KDE/Office 1.03 and other large apps.

----------

## Gandalf_Grey_

 *FastTurtle wrote:*   

> I've got an XP1800 and these are the flags I'm using.
> 
> -march=athlon -m3dnow -mmmx -msse -O3 -pipe. 
> 
>  Because my last build went south with more aggressive flags, I'm sticking with stability over speed right now because I've got a full gig of ram. Speed isn't a problem that I've noticed except with Open Office taking forever to load. 
> ...

 

If you have an athlon XP I hardly think that using the athlon-xp cflag is being aggressive.

----------

## Forge

OK, here's my semi-definitive Pentium/Athlon features guide and cache lecture.. I hope lynx doesn't barf.

(These are only cflag-relevant features, but I won't go into cache line sizes, etc.)

486: Not much. FPU.... Usually.

Pentium non-MMX: Same as 486, but i586.

Pentium MMX: adds MMX. Duh.

Pentium 2: Same as Pentium MMX, now i686.

Pentium 3: Adds SSE.

Pentium 4: Adds SSE2.

Athlon: Pentium 2, plus Advanced (aka Athlon) 3Dnow. Same cflags as any K6-* as far as 3Dnow goes.

Athlon Tbird (on-die L2, socketed Athlon): Same as Athlon.

Athlon XP: Adds SSE, known as '3Dnow Professional' for marketing reasons. 3Dnow Pro actually includes new 3Dnow instructions, as well as finishing out SSE support (Athlons with MMX and 3Dnow had *some* of the SSE instructions, but not enough to use it as SSE)

Athlon XP (Barton): Goes to 512K L2 instead of 256K on Tbird through Athlon XP)

Athlon64/Opteron: Adds SSE2, 1MB (1024KB) L2.

Celeron '1' (266MHz through 533MHz): Pentium 2, with 128K L2 instead of 512K/256K. The 266, 300 non-A, and 333 non-A versions actually have NO L2 whatsoever. These are fairly rare, though, and slot-only, FWIR.

Celeron '2' (533A MHz through 1.4GHz): Same as a Pentium3, SSE is added to the basic '1' Celeron. Early versions had 128K L2, A little past 1GHz, they moved to 256K L2.

Celeron 'P4' (1.6-2.4 or so): Same as a Pentium 4 (MMX, SSE, SSE2), only 128K L2 cache, though.

Now, as for cache sizes: Pretty much all of the Pentiums (P2 through P3 for sure) had 32K L1. This is divided into 16K of 'instructions' and 16K of 'data' cache. L1 cache and L2 cache are 'inclusive'. This means that any data that is in L1 MUST be in L2 also. Therefore a Pentium 2 with 32K of L1 and a 512K L2 has a TOTAL usable cache of only 512K. The Pentium 1's and MMXes had variable amounts of L2, sometimes 512K, sometimes 1MB, sometimes 2MB, always on the motherboard. Pentium 2's have 512K of L2 cache on the CPU card, but not on the core, it runs at half the speed of the CPU itself. The Pentium 3 had the same arrangement at first, 512K on card. Later Pentium3's (Coppermine core) had 256K of L2 cache on the CPU core, running at full CPU speed. All Celerons have on-die, full-speed L2. The Pentium 4 is the odd duck out... It has '12k micro ops' of L1 instruction cache... This is figured to be roughly 8KB. There is also 12K of L1 data cache, IIRC. This is inclusive. The first Pentium4s had 256K of on-die

cache. Later models (Nortwood core), starting at 1.6A through 3.2GHz, have 512K L2. Still inclusive. 512K total CPU cache.

Athlons, on the other hand, have *exclusive* L1/L2 caches. This means that data can be in L1 or L2, without the need to be in both. It's a minor boost in most things, since the data only has to be copied to the CPU once, and it allows more thorough utilization of the caches. This is much more important to Athlons than Pentiums, though, since Athlons (all of them, Athlon slot up through Barton and even the Opteron/Athlon64) have 128K of L1 cache. The original slot Athlon (Athlon Classic) had 128K of full-speed, on-cpu L1 cache, and 512K of L2 cache on the CPU card. This ran at 1/2, 2/5, or 1/3 of the CPU clock speed, depending on the CPU speed. (500MHz Athlons were 1/2, 750s were 2/5, 900+ were 1/3, IIRC). The Athlon 'Tbird' (Thunderbird core) changed this. It's a socketed CPU, so the L2 cache moved onto the CPU, changed to full CPU speed, and shrunk from 512K to 256K. This stayed the same for every Athlon from the Tbird through the Athlon XP, finally changing with the recent Barton core, which finally has 

512K of full-cpu-speed L2. The Athlon64/Opteron have 1MB L2s. Now, since the caches don't have to hold the same info, marketing types often refer to the dual 64K L1s and the 256K L2 as '384K CPU cache'. This is technically correct. Since the Barton has 128K+512K, it technically has 640K total CPU cache. The Opteron/Athlon64 have 128K+1024K, 1152K total cache. Typically only marketing types refer to the caches this way, though. The Durons have always had 128K L1 and 64K L2. On a Pentium this wouldn't work at all, but since the Athlon series have exclusive caches, it gives the Duron 192K total cache... On an equivilent Pentium, it'd backfire, since only 64K of the L1 could be in L2 and thus used... Funny, eh?

Hope this cleared up more than it obscured, let me know if not.

----------

## pr0t0type

Wow, great info guys. Thanks for all the good explanations  :Smile: 

Just done an emerge world with these cflags and added 3dnow, mmx and sse to my use flags

```

-march=athlon-xp -O3 -pipe -fomit-frame-pointer -fpmath=sse,387 -falign-functions=4 -fprefetch-loop-arrays -fmerge-all-constants -mmmx -msse -m3dnow

```

Anyone see any stupid mistakes here? 

Should find out how it runs in an hour or so. Also am i right in thinking that the kernel doesn't use these flags, it uses it's own in /usr/src/linux/makefile If so am I wise to leave it or to put in the optimized flags too?

Thanks

----------

## Gnufsh

1) leave the kernel flags alone

2)-mfpmath=sse,387 is usually sower than the default, so is -mfpmath=sse, at least on AMD machines, which I sure hope yours is, since you're using 3dnow.

----------

## T2

Just for info: I've installed openofice 1.1 rc2 binary package from official site and its way more speedier and responsive that openoffice 1.01. So there's probably no such need for recompiling here.

----------

## LinuxDolt

i've got a p3 coppermine 933 MHz...  what would be the most optimal (read as aggressive as i can get without having too many compile probs) cflags for me?

----------

## byns

Ok I got a P3 Mobile after copying and pasting of all the post in this thread, I made these CFLAGS to quench the most optimization out of my CPU (without breaking exact math btw) The machine is really slow (933 MHz on AC) so I desperately need more speed.

```

CFLAGS="-march=pentium3 -O3 -pipe -fomit-frame-pointer -mmmx -msse -mfpmath=sse -fthread-jumps -fmerge-all-constants -mno-push-args -mno-align-stringops -frename-registers -fforce-addr -frerun-cse-after-loop -frerun-loop-opt -fprefetch-loop-arrays -falign-loops=4 -falign-functions=4 -falign-jumps=4"

```

I didn't emerge -e world yet. Any comments? Redundant stuff and the likes?

----------

## guard0

here's mine

they work FINE, been using them since 1.4rc1

CFLAGS="-march=athlon-xp -O3 -pipe -msse -ffast-math -fomit-frame-pointer -mmmx -m3dnow -mfpmath=sse -Wall -fexpensive-optimizations -funroll-loops -frerun-loop-opt -fforce-addr -frerun-cse-after-loop -falign-functions=16 -falign-labels=1 -foptimize-sibling-calls -fstrength-reduce -fprefetch-loop-arrays"

i dont remember where i got some of those flags

but they are stable and fast, havent noticed any loss of data or accuracy as a result of using those flags...

----------

## odegard

Hate to be a spoilsports but can't too many optimizations actually ruin performance?

----------

## dalcorta

So could anyone tell me which are the best cflags for a Centrino notebook?  I search the forums (keywords centrino or pentium-m) and I read that it should be either a PIII or a PIV.  So which is best?

----------

## c4Ff3In3 4ddiC+

 *odegard wrote:*   

> Hate to be a spoilsports but can't too many optimizations actually ruin performance?

 

If you read the info pages for gcc concerning optimization flags, you'll see that even the gcc team acknowledges cases where certain optimizations may result in code that is actually slower. -funroll-loops is one optimization that has a tendency to slow some code down.

Now, for my personal experience, I've found that if I use gzip as a benchmark (yeah, I know, it is not very scientific), I will get slightly slower compression times using -march=pentium4 -O3 than if I use -march=pentium3 -O3. Also, I've found that with gzip, -march=pentium4 -O3 is slower than -march=pentium4 -O2.

Note: The differences are on the order of ~0.5 seconds when using the following command:

```
dd if=/dev/zero bs=1M count=1000 | gzip -c >/dev/null
```

----------

## irf2003

 *magnet wrote:*   

> I use the -mfpmath=sse,387 thinggy.
> 
> let's recompile the whole system, I'll post what will happend.
> 
> should I benchmark it before/after ? with glxgears maybe ?

 

I have not gone throught the whole of this thread, but, "-mfpmath=sse,387" is very dangerous, as according to the

gcc docs, the register allocator cannot deal with separate

floating point units, until the gcc devloppers say otherwise,

one should avoid "-mfpmath=sse,387", "-mfpmath=sse" should

do for now

hth

----------

## Daagar

Is there a replacement for the freehackers.org site which seemed to keep a nice list of CFLAGS based on arch? freehackers.org seems to have disappeard :(

----------

## seppe

Hi, I'm rather new in CFLAGS but after I read some threads and freehackers.org I'm now using these:

```

CFLAGS="-march=pentium3 -O3 -pipe -fomit-frame-pointer -mmmx -msse -mfpmatch=sse -fforce-addr -falign-functions=4 -fprefetch-loop-arrays"

```

This is my /proc/cpuinfo:

```

processor       : 0

vendor_id       : GenuineIntel

cpu family      : 6

model           : 8

model name      : Pentium III (Coppermine)

stepping        : 3

cpu MHz         : 800.265

cache size      : 256 KB

fdiv_bug        : no

hlt_bug         : no

f00f_bug        : no

coma_bug        : no

fpu             : yes

fpu_exception   : yes

cpuid level     : 2

wp              : yes

flags           : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse

bogomips        : 1568.76

```

Can anyone verify that these are the best CFLAGS for my Pentium 3 with 800Mhz please? Thanks a lot  :Smile: 

Oh, and I once did a 'emerge -e world' after I changed my CFLAGS but it broke up everything (I couldn't log in anymore etc ..), so now I'm going to just recompile the most important packages (xfree, gnome, moizlla-firefox, evolution, gaim, openoffice, abiword, ..)

----------

## FireBurn

Can I just check if any one is using the latest GCC on gentoo? GCC 3.4.2. And can they please confirm what CFLAGS they're using especally if they're using an athlon-xp.

I've broke my system so many times today it's unbelivable!

Mike

----------

## Gnufsh

I'm using simply:

CFLAGS="-march=athlon-xp -O2 -fomit-frame-pointer -pipe"

right now on my athlon-xp with gcc-3.4.2-r2.

----------

## Nate_S

For those of you with athlon-xp's: I've heard that -mfpmath=sse can actually slow things down.  The reason for this is because AMD's sse implementation is kind of weak compared to intel's (it's there so that it can run precompiled sse code,) however, it has one monster of a 387 coprocesser.  (I have not tried both of them at once)

-Nate

----------

## ktm

The only way to find out what cflags is the best for your cpu, is to do some kind of benchmarking. In the big cfalg threat, CFlags central https://forums.gentoo.org/viewtopic.php?t=5717&start=775&postdays=0&postorder=asc, someone called blackcat4 made a script to benchmark different cflag http://blackcat.ca/dist/bench_gcc

I used the script to benchmark a Pentium II 366Mhz (old IBM Thinkpad 570) with lame (the mp3 encoder). The script compiles lame using one kind of cflags, then run lame three times converting a wav-file to mp3. After this, it does the same again, just using different cflags. 

I tried about 15 different cflags, and to my surprise almost all of them just made my run time slower, including -pipe and -funroll-loops. I found that the fastest cflags for my old Pentium 2 is:

```
 -O3 -march=pentium2 -fomit-frame-pointer -ffast-math 
```

Some says that the -ffast-math is a risky option, but so far everything worked just fine.

I also tested with the -mmmx that TheCoop claims to be faster, but on my system it got lame running about 3% slower.

I'm sure most of the cflags people suggest is faster, but only on newer cpu's. I'm soon gonna benchmark my p4 system, to find out.

----------

## chickaroo

wow i've read every post here and now i'm even more confused as to what i should use lol. i've read the gcc manual about 5 times... and also other sources. my latest CFLAGS have been:

```
CFLAGS="-O3 -march=athlon-xp -m3dnow -msse -mmmx -mfpmath=sse,387 -funroll-loops -fforce-addr -ffast-math -fprefetch-loop-arrays -pipe -ftracer -fomit-frame-pointer -finline-limit=800"
```

and

```
CFLAGS="-O3 -march=athlon-xp -m3dnow -msse -mmmx -mfpmath=sse -funroll-loops -fpeel-loops -funit-at-a-time -fforce-addr -ffast-math -fprefetch-loop-arrays -pipe -ftracer -fomit-frame-pointer -finline-limit=1200"
```

i've been curious about the -mfpmath. that's actually the keywords that brought me here. I was wondering if the Athlon XP has seperate execution units for sse and 387

it says that -mfpmath=sse should produce considerably faster code, and sse,387 uses BOTH. however, does that mean that some of the code will not be optimized for sse and some only for 387? i have no idea.

 i haven't done much benchmarking yet, but i tend to agree with whoever said that the athlon xp has a hell of a 387 fpu., but i don't see how sse could slow stuff down (maybe the athlon xp has a poor sse?).  i might try -mfpmath=387 (or is that default? shouldn't hurt even if it is) or ,maybe it would be better if it uses both... 

i'm also thinking of dropping the -funroll-loops (and maybe some others?) because i do think that the major bottleneck in systems is the I/O so actually larger code may be slower. my monster of a cpu (Athlon XP 2600+ mobile barton overclocked to 2.6GHz) should be able to handle slightly less optimized code.

 i'm just all confused now, especially with -falign-functions falign-jumps and falign-loops.

not much info in man gcc about those. maybe someone can clear some things up for me? or give me some opinions? i think i'm gonna wait before changing cflags and doing emerge -e world. that takes about 40 hours.

----------

## Warped_Dragon

I clicked this thread looking for some ways to further optimize my CFLAGS and/or LDFLAGS. Now I'm just confused :/ Is there a list anywhere, listing CPU types and suggested flags to be used with them?

For instance, mine are:

```

CFLAGS="-march=i686 -pipe -O3 -fomit-frame-pointer"

LDFLAGS="-Wl,-z,now -Wl,-O1 -Wl,--relax -Wl,--enable-new-dtags -Wl,--sort-common -s"

```

My chip is an amd Duron. Is i686 what I should be using for march? I've seen athlon used, and athlon-xp, but can't find out what a duron should be. Oh, and to get back to the 'original' topic, my cat /proc/cpuinfo output:

```

processor       : 0

vendor_id       : AuthenticAMD

cpu family      : 6

model           : 7

model name      : AMD Duron(tm) processor

stepping        : 0

cpu MHz         : 1002.275

cache size      : 64 KB

fdiv_bug        : no

hlt_bug         : no

f00f_bug        : no

coma_bug        : no

fpu             : yes

fpu_exception   : yes

cpuid level     : 1

wp              : yes

flags           : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow

bogomips        : 1998.84

```

So, assuming I am supposed to be using i686, I should add "-mmx -sse -mmxext -3dnowext -3dnow" to my CFLAGS?

----------

## chickaroo

well, there are different types of durons... some based on the athlon and some based on the athlon-xp. that's a 1GHz so i would be guessing it's an athlon. so i'd suggest -march=athlon

----------

## Warped_Dragon

Ah. I didn't even know what it was based on :/ Thanks  :Smile: 

----------

## Nate_S

chickaroo:

AFAIK there is seperate execution units for 387 code and sse code.  Also AFAIK, the athlon does have poor sse, compared to a pentium of the same class.  (and it's 387 is better than the same pentium's)  However, which one will actually be faster really, really depends on the code in question.  For a P4, it's safe to always set it to sse, where as for an athlon, some code may do better with it set to 387 or vice versa.   

My understanding (and I could be way off) is that using both 387 and sse optimizes the code for both, such that it can execute them both in parallel, effectively doubleing your floating point processing power.  however, this often causes very wierd results, and some code will crash and burn with this.  

I'd advise against dropping -funroll-loops, as your startup times may get faster, as an app is loaded into memory, but running the app will most likely get slower, as once in memory you want free registers and quick execution.  

Man I wish I could find the thread I was given this advice in, as the guy there states it far more elequently than me, but search hates me.   :Sad: 

-Nate

----------

## LockeAverame

some people never learn:

first: there are no cflags which are perfect for every program because the compiler mostly uses heuristics for optimization decisions and gcc has not a very good register allocator.

second: your duron is based on the morgan core (you can see it on the cpu frequency and the sse support), so use -march=athlon-xp.

third: mostly the binaries get about a maximum of 5-10% bigger because of optimizations from -O2 to -O3, these means for most binaries an increasement of 100Kb. believe me, it takes longer for the hdd to search the chunks on the drive than to load it actually (hdd's today read about 30-40mb per second but seeking and sequential read/write is quite slow).

the problem with bigger binaries mostly lies in bigger cache usage which is quite limited (even though 256kb to 512kb is quite much).

fourth: march activates mmx sse and 3dnow if appropriate, so don't care about these flags except you don't use march.

fifth: in nbench and freebench i get a performance increasement of 3-4% with -mfpmath=sse on an athlon-xp in comparison of i387 (which is the default). sse has only a precision of 64bit so it can only be used for floats not for double (which is mostly used), so you mostly don't see a big improvement, even though gcc-3.3 and 3.4 don't use sse in vectorize mode (gcc-4.0 will use it). to use i387 and sse in parallel doesn't gain much (nearly nothing) and is very risky, i wouldn't use it. sse2 with its 128bit precision is more precise than i387 with a maximum of 80bit and has fewer problems with exceptions, gcc devels mostly prefer to use sse2 as i remember but some sse2 instructions are buggy, so hope for a fix.

----------

## Warped_Dragon

1) I'm well aware of that. I wasn't asking for a 'perfect' set of flags that would give every program I run a 500% speedboost. Or even a 10% speedboost. Or 0.2%.  *reads over his first post* Wait a tic, I wasn't asking for a 'perfect' set of CFLAGS at all, just if my current ones were taking full advantage of the different optimization options supported by my CPU. I'll assume your first comment wasn't directed at me. Many apologies  :Smile: 

2) Thank you. This is the only AMD system I've ever used, all my experiance is with Intel. Now I've got one suggestion for -march=athlon and one for -march=athlon-xp. Can you point me to some documentation so I can confirm/deny this myself? In fact, I'll start at the AMD site.

EDIT - Found something. Quote is from http://www.amd.com/us-en/Processors/ProductInformation/0,,30_118_1260_1202%5E1018,00.html#9947

 *Quote:*   

> 
> 
> Q:	How is the AMD Duron processor different from the AMD Athlon processor? Is the AMD Duron processor merely a stripped down version of the AMD Athlon processor?
> 
> A:	The AMD Duron processor is a derivative of the state-of-the-art AMD Athlon processor. Although the two processors are related, there are key differences in the CPUs and the platforms designed to support them, reflecting the requirements of their target markets. Specifically, the AMD Athlon processor is available for users who demand the highest level of application performance and features more full-speed, on-chip cache memory. The AMD Duron processor was designed to consume less power than the AMD Athlon processor, thereby enabling lower cost systems. Additionally, AMD Duron processor-based PCs are likely to employ lower cost memory and graphics solutions, including low cost DDR memory and Unified Memory Architecture (UMA) graphics.
> ...

 

I'm still looking for the actual tech specs to back up their claim. This system is about three years old (jan 2001). Were athlon-xp's being sold then?

EDIT #2: Found specs. No mention of a Duron being based on the athlon-xp though :/ Athlon is mentioned a few times, but it doesn't say that it's based on an athlon either. Link is http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24310.pdf if anyone cares. I'm inclined to say screw it and just use -march=athlon

3) So the increase in binary size more then offsets the speed increase in most cases?

4) Ok, so I can drop those flags and still have them activated by march. That's good.

5) 3-4% is better then nothing at all. I'll add that flag if I confirm that my duron should use -march=athlon-xp. Would it result in a similer increase on a reguler athlon?

----------

## Nate_S

While doing your reasearch you'll want to look at what core your proc is based on.  I believe Palimino (sp?) is the first athlon xp core, that's what's in my athlon-xp 1800, if it's not that it's prolly an athlon.  

Yes, generally -O3 is considered by most to be faster than -Os or -O2.  also, I think -funroll-loops (though not -funroll-all-loops,) is also worth it.  

I'd like to point out that redundant flags really don't hurt anything.  

As I mentioned, wheather 387 or sse was better on the athlon would depend heavily on the type of app used, I don't know that a benchmark would give a good overall estimate.  

Also, sse2 is only on P4s and 64 bit AMD chips, right?  Or did some of the later athlon-xps have it?

-Nate

----------

## chickaroo

 *Nate_S wrote:*   

> 
> 
> Also, sse2 is only on P4s and 64 bit AMD chips, right?  Or did some of the later athlon-xps have it?

 

the K8 (Athlon 64) chips were the first AMD chips to include SSE2 support.

----------

## LockeAverame

your duron has sse support, so it's a morgan core which is derived from a palomino core but with fewer cache (64kb instead of 256kb L2-cache), that's the only difference ( the older athlons didn't have sse like the durons in those days ^^). -march=athlonx-p works perfectly (i had this duron myself, so believe it). if you don't want to believe it than read the white papers for your stepping of the cpu (cat /proc/cpuinfo will tell you the stepping).

sse2 was never available for athlon-xp and beneath, it was first integrated with the k8 (opteron/athlon64), if people think this is wrong than they say bullshit.

well, no benchmark can test every possible codesnippet available. i used nbench and freebench and both give a 3-4% increasement with -mfpmath=sse and nothing higher with -mfpmath=sse,387.

only apps which use float (rarely used in standard software, more often used in games and codecs).

redundant flags don't hurt but don't do anything sensefull either, only polluting your bashoutput in compilations ^^.

----------

## headgap

ok, my two cents. from a developer. yes, i know it says n00b, but trust me,

i'm a developer :)

the nice thing about being a 'retired' developer, you get to find bugs, and 

not be obliged to fix them...

here's the deal with speed optimizations:

there's an old saw in the industry: programs spend 90% of their time in 10%

of the code. there's no point trying to optimize the other 90% of the code, it

won't get you any advantage. developers typically take the stand-point: 

simpler is better, and don't do any optimization until release-time.

when it comes to optimizing code, you can: a) use a better (faster) algorithm,

b) code bits in assembler, c) do both. i once wrote a message passing queue

based system entirely in pentium assembler (about 20k of asm, vs 120k of C).

it ran 10x faster than the portable C code did, and tripled the development

time.

if you take a really close look at the compile process for the sources, you'll

notice individual packages very selectively using additional compile flags,

such as glibc: -freorder-blocks. this is the developer's way of descending in

a portable manner partially to the assembler level. gaming code and real-time

instrumentation contain way more assembler, and device drivers are almost

pure assembler: because that's where speed counts.

any program requiring feedback from the user, or that needs to hit slow 

storage (disk systems, networks) won't benefit significantly from speed optimization (meaning, more than 10-15% increase).

*unless*

it's development code, the bugs are still being worked out, the code base 

hasn't been finalized yet, so it doesn't make sense to optimize, since it might

get thrown away eventually.

since gentoo is providing bleeding edge packages, using higher levels of 

optimization will tend to a) give that 10% increase, and b) break things.

since you're not the developer of the package, and have no idea what sort of code is being used, you can't pick the one or two special flags that would

make that code really fly. the best you can hope for is apply 'everything', and cross your fingers. optimizations that work for some code, won't do a thing for the next package. but in general, you'll see some sort of

improvement, on average, since the underlying code isn't already optimized.

so,

-O2 for CPUs with large L1 cache, -Os for those earlier ones (Coppermine and lower) with less L1.

-march if you don't need portable code will select hardware-only features appropriate to your cpu

-ffast-math at your own peril, if you do anything with double-precision

floating point (financials, spreadsheets, DV, imaging, mJpegTools, et al)

and, selectively, very specific individual flags based on the nature of the source, on a per-file basis, if you know what you're trying to achieve, and

need your optimization to be as platform transparent as possible.

----------

## evilshenaniganz

I've been having a helluva time with a K6-2 of mine.  I would really like to get as much optimization out of it as I can.  I notice I often have builds fail when I use

```
-march=k6-2 -O3 -pipe
```

  The error message is saying it's a hardware problem, OS-related... Segmentation faults are common as well.  When I back it off to 

```
-march=k6-2 -O2 -pipe
```

 these problems go away.  Here's the skinny on my CPU.  (For readability I decided to just link to a page instead of messing up the tabs posting it here in the forum.)

As you can see from the link, the chip is a k6-2 "chomper".  I googled around and didn't find much of anything.  Does anybody have any suggestions as to how I can optimize the hell out of my CFLAGs for this chip?  Here are the CFLAGs I've been using that are giving me errors, and like I said, backing it off to -O2 takes care of them:

```

CHOST="i586-pc-linux-gnu"

CFLAGS="-march=k6-2 -O3 -pipe -fomit-frame-pointer -mfpmath=387 -mmmx -m3dnow -m128bit-long-double"

CXXFLAGS="${CFLAGS}"
```

Any suggestions, feedback, or other places to look would be greatly appreciated!  Thanks!  :Smile: 

----------

## evilshenaniganz

Hey, it's me again...

I figure if I'm coming with a problem, I better bring something else to the table.  I noticed a lot of people are asking about resources for finding out more about CFLAGs.  Well, apart from  Ye Olde CFLAGS Central, I have also found a few pages that might help someone out.  Here they are:

http://gcc.gnu.org/onlinedocs/gcc/i386-and-x86_002d64-Options.html

http://gcc.gnu.org/onlinedocs/gcc/C_002b_002b-Dialect-Options.html

http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

HTH   :Very Happy: 

----------

## schiznik

Wow, I'm surpised no-one has mentioned the gentoo-wiki page http://gentoo-wiki.com/Safe_Cflags yet.

I use 

```
CFLAGS="-march=athlon-xp -mtune=athlon-xp -O3 -pipe -fomit-frame-pointer -msse -mmmx -m3dnow"
```

 on my athlon 2000+ XP.

I'm in the process of installing a copy on my main gentoo install onto a friend's pc (Duron 800 iirc - early duron, NOT a morgan core'd Duron). His CFLAGS will probably be:

```
CFLAGS="-march=athlon-tbird -mtune=athlon-tbird -O3 -pipe -fomit-frame-pointer -mmmx -m3dnow"
```

I might change -O3 to -Os to save a little space on his relatively small HD.

Note: I use -mtune= instead of -mpcu= as I use an ~x86 toolchain on an x86 install (ie these machines are using gcc 3.4.x instead of 3.3.x) - adjust back to -mcpu= if you use an x86 toolchain.

Edit: Old Durons dont support sse, removed -sse (and am starting to recompile.. again...)

----------

## idahoduk

 *Gnufsh wrote:*   

> I'm using simply:
> 
> CFLAGS="-march=athlon-xp -O2 -fomit-frame-pointer -pipe"
> 
> right now on my athlon-xp with gcc-3.4.2-r2.

 

I've been trying to figure out what to set my flag to for my CPU, I have an AMD X2 3800.  I'm a little confused after reading the last seven pages.  Will this be a good option for enabling the SRS, 3DNOW etc...

CHOST="i686-pc-linux-gnu"

CFLAGS="-march=athlon64 -O3 -pipe -fomit-frame-pointer"

CXXFLAGS="${CFLAGS}"

This is from the safe SFLAGS section on the Gentoo site, there seems to be some debate as to what this enables on the CPU.  I did change the -02 to -03 since it's a dual core.

Thanks for the help guys, I came over to Gentoo from slack and Suse, I've been more then impressed with the documentation and the community you guys have here.  I'm looking forward to learning more and getting to know everyone.  THANKS!!!

----------

## ebfe

moderators should delete or at least close such threads as ALL of what has been posted here about cflags is bogus information and give no speed increase at all. Read this last sentence again.

----------

