# [solved] Additional cflags or tweaks for piledriver cpu's?

## vexatious

Been using the following CFLAGS for my piledriver CPU, as recommended by AMD's official GCC optimization guide (with a couple of my own: mfpmath=sse msseregparm):

```
export CFLAGS="-O2 -pipe -march=native -msse -msse2 -msse3 -msse4a -mno-3dnow -msseregparm -mfpmath=sse -fomit-frame-pointer -fopenmp -mprefer-avx128 -minline-all-stringops -fno-tree-pre -ftree-vectorize -funroll-all-loops -fprefetch-loop-arrays -mtune=bdver2"

# This flag gives "cannot compile executables" error with gcc-4.7.3.

# Didn't happen with gcc-4.7.1=WTH...

#--param prefetch-latency=300
```

Does anyone know of additional tweaks?

I'm asking because I noticed claims of up to 3x greater performance by merely using bdver CFLAG; according to this: http://phoronix.com/forums/showthread.php?64665-GCC-4-6-LLVM-Clang-3-0-Open64-Benchmarks

Intel also shows some performance gains of other types according to this: http://software.intel.com/en-us/blogs/2012/09/26/gcc-x86-performance-hints.  Could any of these CFLAGS be applied for additional gains on piledriver?

So are there any more tried and true GCC optimizations for bulldozer/piledriver cpu's?

Regards

----------

## Jaglover

-omg-optimized never failed on me.

----------

## Maitreya

 *Quote:*   

> 
> 
> HOLY COW I'M TOTALLY GOING SO FAST OH F***
> 
> 

 

Just stick with native for automagic.

----------

## shazeal

 *Jaglover wrote:*   

> -omg-optimized never failed on me.

 

I lost 10 kilos following this guide, the extra pulling of hair and screaming definitely helps! Highly recommended AAAA++++

 *vexatious wrote:*   

> Does anyone know of additional tweaks? 

 

export CFLAGS="-O2 -pipe -march=native"

There fully optimized.

----------

## Roman_Gruber

 *shazeal wrote:*   

>  *Jaglover wrote:*   -omg-optimized never failed on me. 
> 
> I lost 10 kilos following this guide, the extra pulling of hair and screaming definitely helps! Highly recommended AAAA++++
> 
>  *vexatious wrote:*   Does anyone know of additional tweaks?  
> ...

 

+1

careful with too many compiler flags. I used to use a few and had random compile errors. Means the emerge command suddenly stops with an error. If you can solve such issues, go ahead. But I think most bugs are flagged invalid because of too many compiler flags. They used to do it for myself a few years back.

So in short. If you can handle such random runtime errors, go ahead, if not stick to the plain simple. march-native.

I highly recommend using march-native and the stable compiler in portage if you want a fuss free life.

Usually those flags do not really give a big boost. I tried several times on my T9600 over a few years. Gain nearly nothing. You gain a lot of bugs thats it.

For those amd cpus I have read they need special fast memory. So better buy the fasted Dram with those double memory banks or what the are called. That is much a better investment in any OS.

----------

## vexatious

No knowledge of piledriver I guess...

This isn't really a joke.  I'm making slackware packages a lot faster vs vanilla slackware (gcc-4.8.2) with the CFLAGS I mentioned (packages get compressed with xz compression at -4e).  Only a couple packages didn't like unrolled loops and/or required fPIC, and unrolling loops is a popular tweak despite giving extremely small gains in most cases (why is that?).  Machine is more stable and responsive as well.

----------

## N8Fear

 *vexatious wrote:*   

> No knowledge of piledriver I guess...

 

No knowledge of compiler optimization I guess...

"-march=native -O2" expands to the cflags that your combination of cpu and compiler support. There is really no need to add anything else. You could try -O3 but that doesn't guarantee that the resulting binary is semantically the same as the source that was compiled.

One thing that you could add would be the graphite useflag to gcc and " -floop-interchange -floop-strip-mine -floop-block" to your cflags after rebuilding gcc.

----------

## shazeal

 *N8Fear wrote:*   

>  *vexatious wrote:*   No knowledge of piledriver I guess... 
> 
> No knowledge of compiler optimization I guess...

 

True, but ricers gotta rice yo! Don't be hatin on the segfaults yo!   :Laughing: 

----------

## vexatious

LOL @ shazeal.

 *N8Fear wrote:*   

>  *vexatious wrote:*   No knowledge of piledriver I guess... 
> 
> No knowledge of compiler optimization I guess...
> 
> "-march=native -O2" expands to the cflags that your combination of cpu and compiler support. There is really no need to add anything else. You could try -O3 but that doesn't guarantee that the resulting binary is semantically the same as the source that was compiled.
> ...

 

Right.  I understand a compiler is simply the most generic way build to software from generic code (mostly C).  If I really wanted speed I'd have to code in assembly AFAIK (instead of letting compiler decide); that's a huge pain for large software packages however.

That graphite useflag and other cflags are something I have to try.  I've heard the graphite useflag can make some speed gains.

Really appreciate the responses!

Regards

----------

## Roman_Gruber

 *vexatious wrote:*   

> No knowledge of piledriver I guess...
> 
>  Machine is more stable and responsive as well.

 

may I ask why it is more stable and responsive.

I ask out of curiousity, i am not intended to offend or flame.

just curious maybe i learn something.

any proof for your statement?

i used the graphite flag a year ago and had only hassle with my ~amd64 on my notebook. went back and hassle was gone.

isn*t responsiveness not about the kernel itself how it handles the tasks in the scheduler and the amount of memory in your box combined with the timeframe, or ticks, what kernel devs call them.

Please clarify, thanks. I ask out of curiousity, thats it.

----------

## N8Fear

 *vexatious wrote:*   

> 
> 
> Right.  I understand a compiler is simply the most generic way build to software from generic code (mostly C).  If I really wanted speed I'd have to code in assembly AFAIK (instead of letting compiler decide); that's a huge pain for large software packages however.
> 
> That graphite useflag and other cflags are something I have to try.  I've heard the graphite useflag can make some speed gains.
> ...

 

A compiler is a program that translates a higher level language to machine language (which can be represented by assembler since assembler maps 1:1 to machine code). Coding in assembler doesn't increase the performance per se. You can easily write a program in assembly that is slow as hell. Writing in assembly allows you to optimize because you can optimize your code to a specific processor (i.e. you can optimize the use of caches by avoid cache line trashing, you can reorder independent instructions in a way that makes most use of pipelining or optimize the use of the branching prediction).

The problem is that your piece of code would be optimized for one specific processor. It would run fast on e.g. an i7 but not necessary on a piledriver (due to architectural differences it would most likely not be performing well). Because of that you would not only have to write a program in assembly, but you would have to create a different source for each processor it should run on (which requires intricate knowledge of the internals of a given processor).

You can also optimize in C (e.g. you can walk through an array of arrays by line and not by column because you avoid cache line trashing). It is also important to recognize that most code must not really be optimized hard because your processors tend to idle most of the time anyways (for big number crunching calculations in e.g. scientific applications that would not necessarily be true).

The next step on the ladder is to realize that modern compilers can really optimize in a very good way (that's what you try to archive with your heap of cflags). Compilers can for example unroll loops (to better use pipelining and branch prediction)  or even eliminate loops completely if the result is static anyways (take a look of the assembly of a loop that does nothing than counting to MAX_INT: if you use optimization the compiler won't calculate anything but simply take MAX_INT). In fact modern compilers optimize good enough that you need quite a bit of knowledge to get more optimization even if you code assembly yourself.

This takes us to cflags: you don't get most out of your processor by taking a bunch of them and throw them at your program but by making  a choice that actually fits to your processor architecture. If you want to archive this, you can either take a bunch of technical documentation an read enough to enable you to code optimized C or assembly yourself, you can take some arcane ricer flags from some forum post of someone that hopefully has done the research for you or you can trust people who know how things work. selecting -march=native expands to a set of cflags that some wise guys deem optimal for your architecture.

You can run:

```
gcc -march=native -E -v - </dev/null 2>&1 | sed -n 's/.* -v - //p' 
```

 to look how it expands for your machine. Mine is for example 

```
-march=corei7 -mcx16 -msahf -mno-movbe -maes -mpclmul -mpopcnt -mno-abm -mno-lwp -mno-fma -mno-fma4 -mno-xop -mno-bmi -mno-bmi2 -mno-tbm -mno-avx -mno-avx2 -msse4.2 -msse4.1 -mno-lzcnt -mno-rtm -mno-hle -mno-rdrnd -mno-f16c -mno-fsgsbase -mno-rdseed -mno-prfchw -mno-adx -mfxsr -mno-xsave -mno-xsaveopt --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=3072 -mtune=corei7 -fno-strict-overflow -fPIE -fstack-protector-all -fstack-check=specific
```

.

It doesn't serve any benefits to add these yourself other than showing other people that you in reality have no idea what these flags really do....

This expansion shows just CPU specific stuff, which generic optimization options are used depends on your -Ox. You'll find an overview over what is selected here: http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html.

I hope that this helps you (and likely others) to see that while there are benefits in using optimization, optimization is not archived by a random bunch of cflags that you throw into your compiler and that the best way to optimize is trusting the guys who know how to do it correctly (and therefore choose sane but efficient defaults)....

----------

## John R. Graham

Split off after this point for really egregiously obvious Forum Guidelines violations. Vexatious, please remain civil.

- John

----------

## Jaglover

These forums are getting much more liberal than they used to be. Once I got banned from these forums for much less. I stayed away for many years. Or is liberal a wrong term perhaps?

----------

## John R. Graham

 *N8Fear wrote:*   

> ...I hope that this helps you (and likely others) to see that while there are benefits in using optimization, optimization is not archived by a random bunch of cflags that you throw into your compiler and that the best way to optimize is trusting the guys who know how to do it correctly (and therefore choose sane but efficient defaults)....

 Alas, this isn't always the whole story. When a CPU is new, -March=native sometimes doesn't know about specific features that can be safely enabled. Now, I don't personally know whether contemporary gcc is Piledriver-aware. Do you?

- John

----------

## N8Fear

Yeah - I do. Take a look at http://developer.amd.com/community/blog/2012/04/23/gcc-4-7-is-available-with-support-for-amd-opteron-6200-series-and-amd-fx-series-processors/.

So GCC 4.7.x should have piledriver support. Generally one can say: a new processor needs a new compiler version.

----------

## John R. Graham

Thanks; good to know. However, it does not follow that all new compiler versions support all released CPUs. I think you were painting a slightly incomplete picture.

- John

----------

## N8Fear

I think in most cases the cpu vendors try to get the optimization into gcc before the actual release of the processor (at least for amd and intel this will most likely be nearly always the case). You are likely correct that there may be times when the compiler doesn't correctly support the features - it'll normally make a "downgrade" to the highest supported version (e.g. core2 instead of corei7).

What holds in any case should be that choosing the correct cflags (at least the more arcane ones) should be left to people who know what they have to do...

----------

## _______0

not only extra cflags don't translate automagically into faster code but the programmer needs change the code to take advantage of a specific new CPU flag.

one example is unicode decoding with ssse and I've read now there's a project decoding it with the gpu.

Ultimately it comes down to the programmer coding it the most optimized way.

----------

## John R. Graham

 *N8Fear wrote:*   

> I think in most cases the cpu vendors try to get the optimization into gcc before the actual release of the processor (at least for amd and intel this will most likely be nearly always the case). ...

 This emphatically did not happen with Atom. When Atom was released, -march=native did what you said, choosing a safe subset, but this ignored several Atom features (which were already supported because they were present in other architectures) which could be enabled via additional CFLAGS.

- John

----------

## vexatious

Thanks for all the help and I'm really sorry for giving bad feedback (especially in my deleted post about something racial).  I really do appreciate the help and think this is one of the best forums with some of the smartest people!

God bless!

----------

