# Core 2 Duo - Merom

## d99ma

Hello,

I've just placed an order for a new laptop and it comes with the Core 2 Duo cpu - Merom. What I can gather from the threads I've found people either recomends using march=nocona or pentium-m with msse3 for the conroe desktop version. Which one is the best choice for a laptop merom CPU?

Thanks

/Martin

----------

## Lloeki

depends on whether architecturally merom is closer to nocona (desktop) or to banias/dothan (pentium m).

I bet the latter, as pentium m were closer to pIII than to their pIV desktop counterparts.

----------

## alechiko

 *Lloeki wrote:*   

> depends on whether architecturally merom is closer to nocona (desktop) or to banias/dothan (pentium m).
> 
> I bet the latter, as pentium m were closer to pIII than to their pIV desktop counterparts.

 

Um... Ja but isn't the merom 64bit? Surely P-M/P-3 flags are going to ignore that 64bit goodness.

----------

## Lloeki

maybe add -m64 ? really, I don't know, I'm not that proficient in 64bit.

----------

## d99ma

64bits is another question.

Is there anyone running 64-bits linux on the merom?

Is the small? performance gain worth eventual problems?

/Martin

----------

## rhill

binutils newer than 2.16.91.0.7 actually support -march/tune=merom.  However, the required GCC bits [i], which also add -mmni (Merom New Instructions) haven't gone in yet, and won't until 4.3 opens for development [ii].

in the meantime, the IA-32 Intel Architecture Optimization Reference Manual definitely seems to put Core and Core 2 into the Pentium-M family.  for 64 bit, i'm not sure if -march=pentium-m -msse3 -m64 will do the job or not.  i do know that the amd64 profile will spit a big freakin annoying red warning and pause for 5 seconds every time you try emerge something with -m64 in your CFLAGS [iii].  :Evil or Very Mad:  it's possible that the amd64 profile automatically adds -m64 for you, but you'll have to check for yourself since my em64t box is booted into x86 at the moment.

[i] http://gcc.gnu.org/ml/gcc-patches/2006-02/msg01866.html

[ii] http://gcc.gnu.org/ml/gcc-patches/2006-07/msg00605.html

[iii] a properly motivated person might create a executable script in /etc/portage/postsync.d containing "rm /usr/portage/profiles/default-linux/amd64/profile.bash" (or even a sed line removing a certain flag from $BAD_FLAGS if they still wanted the "filter invalid or nonexistent flags" functionality).  this of course would be highly unsupported.  :Wink: 

----------

## d99ma

dirtyepic, thanks for the insight!

I will probably stick to 32bits and wait for proper merom support from gcc before switching to 64 bits.

----------

## darkphader

 *dirtyepic wrote:*   

> in the meantime, the IA-32 Intel Architecture Optimization Reference Manual definitely seems to put Core and Core 2 into the Pentium-M family.  for 64 bit, i'm not sure if -march=pentium-m -msse3 -m64 will do the job or not.  i do know that the amd64 profile will spit a big freakin annoying red warning and pause for 5 seconds every time you try emerge something with -m64 in your CFLAGS [iii]. :evil: it's possible that the amd64 profile automatically adds -m64 for you, but you'll have to check for yourself since my em64t box is booted into x86 at the moment.

 

Other posts I've read, plus the installation handbook state that -march=nocona is proper for EM64T users.

Sorry, don't really know if/why that info is valid for the merom.

Chris

----------

## rhill

to add one more thing to the confusion, i just stumbled over this:

 *Quote:*   

> > So, this person has a pentium m, and /proc/cpu info says his processor 
> 
> > belongs to family 6... as you can see, mine also belongs to family 6... 
> 
> > so even though the wiki stages march=prescott, what do you guys think?
> ...

 

http://article.gmane.org/gmane.comp.gcc.help/15506

And SSE is definitely the win on Core chips:

 *Quote:*   

> Core Duo Processors
> 
> On Intel Core Solo and Intel Core Duo processors, the combination of
> 
> improved decoding and micro-op fusion allows instructions which were
> ...

 

http://www.intel.com/design/pentium4/manuals/index_new.htm

i think the switch mentioned would be -mfpmath=sse

----------

## irondog

Dirtyepic, should Core 2 Duo users use mtune=merom if it would already be available? Or is "nocona" more appropriate in some situations?

----------

## ECantona

Finally, which cflags can we safely use with a merom processor? I'm a bit confused   :Rolling Eyes: 

and what about -march prescott?

----------

## ECantona

and also, which processor family should we choose in kernel configuration for a merom processor?

----------

## Lloeki

 *Quote:*   

> Or is "nocona" more appropriate in some situations?
> 
> and what about -march prescott?

 

damn, how much has this to be said?

nocona and prescott are architecturally totally different to merom.

this is like using pentium4 for pentium-m, you will lose performance. things forked at pentium3. so, when pentium-m didn't exist, pentium3 was the flag to use. the closest thing to merom is core duo (which is 2 pentium-m cores). 

you need sse3? add -sse3.

you really want 64bit? add -m64 to cflags. gcc manual:

 *Quote:*   

> 
> 
> -m32
> 
> -m64
> ...

 

don't be fooled by 'AMD', this is really the flag to use. EM64T and AMD64 are (mostly) compatible, and different than IA64 (Xeon).

I'll receive my merom in 2 to 5 days, and anyway, I could care less about 64 bits. screw up videogames console marketing, 64bit anything but like 2*32bit: 64 bit is a tad slower, takes a tad to a whole more space (e.g the L2 cache will be twice as filled in 64bit mode than in 32bit, so a 64bit 4MB cache is effectively like a 32bit 2MB cache), and I don't need the extra precision (which is in the end, un-precision, as x87 computes on 80bit, while 64bit instructions (like sse) compute on... 64bit) and pointing ability (I have 'only' 1Gb ram). and I'll save myself some chroot/emul-linux/32v64 binary headaches (yes, I do use closed source, and I need them).

see ya in 2038. till then, 64 bit is just 'try, and adapt before convert'.

----------

## Mad Merlin

 *Lloeki wrote:*   

> 
> 
>  *Quote:*   
> 
> -m32
> ...

 

Actually, IA64 [1] is Itanium (2), not Xeon [2]. Xeon:Pentium::Opteron:Athlon.

[1] http://en.wikipedia.org/wiki/IA-64

[2] http://en.wikipedia.org/wiki/Xeon

----------

## Lloeki

Mad Merlin, thanks for the correction  :Smile: 

----------

## lcj

I'm using on Core Duo T7200 following flags: 

```
CFLAGS="-O2 -march=nocona -mtune=nocona -msse3 -mfpmath=sse -pipe -fomit-frame-pointer"
```

. I've moved from pentium-m day ago, and currently I'm running recompiled XGL and firefox, with no problems. But 2.6.18 kernel fails randomly during boot when disk access is heavy, I need to research that... Not sure if it's related to flags

----------

## Lloeki

of course you won't encounter any obvious problem (crash, 'illegal instruction', etc...). what you may encounter is performance problems. 

pipeline length, l1 cache handling, design philosophy, etc... see here (and links) why merom is closer to a pentium-m than to a nocona. and of course, you won't get 100% out of your merom with a -march=mentium-m, but where you'd get 80% with p-m, you will only get 40% with nocona (dummy figures, but hey, expect anything near with a "pipeline [...] less than half of Prescott's"). you'd better use p3 altogether.

anyway, what we're talking about is microsecond improvement, so you won't see much difference in the end, and as you will eventually rebuild everything once march=merom is out, you should wait altogether.

as a side note, mtune is redundant, as march implies enhancements of mtune, plus specifics. I don't know what precedence gcc gives to each one, but it may as well disable your glorious march optimisations, in favor of safer mtune ones.

----------

## lcj

So to sum-up I'd need to run some benchmarks to make sure that pentium-m is better for the time being than nocona...

----------

## Lloeki

no, pentium-m (predecessor of Core arch) has 98% chances of being faster than nocona (netburst arch).

from link:

 *Quote:*   

> Intel has replaced NetBurst with the Intel Core microarchitecture, released in July 2006, which is more directly derived from 1995's Pentium Pro than it is from NetBurst.

 

----------

## lcj

Hmmm... I've compiled with gcc (4.1.1, unstable Gentoo) gimp with tune nocona and then with only pentium-m flags and frankly it looks like saving 4096x4096x24 PNG file is rather faster with nocona switch than it is with pentium-m for Core 2 Duo T7200. Judging from your discussion here I expected rather the oposite.

----------

## Lloeki

I fail to see how writing a file to disk can be a benchmark of cpu performance.

plus in this case it certainly relies on at least gtk and glibc, and maybe some libs for png conversion, so these should have to be rebuilt too.

benchmarking such things are really not easy. at all.

----------

## lcj

Well, given the fact tha the file is buffered completly (no actual disk access), it's just pure CPU power used to compress bitmap. Sure benchmarking is not easy, but since the difference is noticable I need to check kernel compilation times. Anybody else doing such experiments ?

----------

## rhill

 *Lloeki wrote:*   

> this is like using pentium4 for pentium-m, you will lose performance. things forked at pentium3. so, when pentium-m didn't exist, pentium3 was the flag to use. the closest thing to merom is core duo (which is 2 pentium-m cores).

 

first, please post some numbers to back up your statements.  second, you're missing the big picture.  it's NOT a pentium-m microarch.  they didn't "fork" anything.  it's similar in design philosophy, and shares a lot in common with that CPU.  but there are major differences.  see above for just a few examples.  i personally don't know one way or another.  i've asked on the gcc mailing list but haven't received a reply.  i use -march=prescott, others can use whatever they want, but i refuse to recommend anything without seeing the numbers first.

 *Quote:*   

> you really want 64bit? add -m64 to cflags.

 

no, you can't do that without running a 64bit multilib portage profile.  you need certain libraries to be 32bit and others to be 64bit.  forcing -m64 will break things.  it may be possible to run an amd64 profile with -march=pentium-m, but i really don't play with the amd64 toolchain enough to know.

 *Quote:*   

> I'll receive my merom in 2 to 5 days, and anyway, I could care less about 64 bits. screw up videogames console marketing, 64bit anything but like 2*32bit: 64 bit is a tad slower, takes a tad to a whole more space (e.g the L2 cache will be twice as filled in 64bit mode than in 32bit, so a 64bit 4MB cache is effectively like a 32bit 2MB cache), and I don't need the extra precision (which is in the end, un-precision, as x87 computes on 80bit, while 64bit instructions (like sse) compute on... 64bit) and pointing ability (I have 'only' 1Gb ram).

 

huh?

----------

## Lloeki

first things first, I never intended to provide the Absolute Truth About Everything. I gather elements which I find relevant and expose them for discussion, and readily accept any correction  :Smile: 

if by numbers you mean benchmarks, I can't provide benchmarks  :Wink:  because:

 *Quote:*   

> benchmarking such things are really not easy. at all.

 

and

 *Quote:*   

> I'll receive my merom in 2 to 5 days

 

as numbers, the sole numbers I have are in the links provided:

 *Quote:*   

> Core's execution unit is 4-issues wide, compared to the 3-issue cores of P6, P6-M (Banias, Dothan, and Yonah), and NetBurst microarchitectures

 

so p-m and prescott are even here, and optimized code for this won't be generated until merom arrives. I'm uncertain if gcc optimizes code for such a feature.

 *Quote:*   

> The pipeline is 14 stages long

 vs *Quote:*   

> The Prescott achitecture, the last core of the Pentium 4, has a 31 stage pipeline

 

optimizing code for a 31 pipeline and feeding it to a 14 pipeline is certainly insane throughput-wise. pipeline techniques like predictive branching will certainly be affected. if I'm not mistaken, gcc does that kind of code optimization.

 *Quote:*   

> The Prescott was produced [...] addition of an even larger cache (from 512KB in the Northwood to 1MB, and later 2MB)

 

merom has 4mb, so there will be a net loss here. if I'm not mistaken, gcc does that kind of code optimization too.

 *Quote:*   

> no, you can't do that without running a 64bit multilib portage profile. 

 

of course you don't  :Smile:  the point was to expose how to 'manually' generate EM64T instructions without -march.

 *Quote:*   

> but there are major differences

 

you're right, the biggest one being the presence of two cores, and linked-l1/shared-l2 cache handling, which by itself justifies a new march.

 *Quote:*   

> huh?

 

64bit interest is in:

- computing twice the precision at same speed

- handling and addressing long long directly

this has the advantage of:

- handling >4Gb ram efficiently

- handling >1Gb ram very efficiently

- handling big files

- number-crunching apps, where higher precision will come at no performance cost

- save us from a 2038 blackout  :Wink: 

64bit drawback is:

- it takes twice more space as 32bit

which has implications (not necessarily in 2* order) on generated code size, l1/l2 cache usage, ram usage, and so on. lots of (more or les arguable) benchmarks are available. I read a very accurate one in that amss but I can't find it anymore.

so what I meant is, for now, I'll play around with 64bit, but I'll install and run a 32bit gentoo.

again, that's what I gathered from the net, mixed with personal knowledge, and concluded. I readily accept any constructive critics, I am happy to learn always more  :Smile: 

----------

## xentric

I have the E6300 Core2 Duo (Allendale) in my system.

What's best to be used as "Processor Family" when configuring my kernel, Pentium-M or Pentium-4?

And does this processor support "CPU frequency scaling" with Intel Enhanced Speedstep or Intel Pentium-4 clock modulation?

----------

## Lloeki

concerning the family to choose, the same discussion as before should apply (that is ideal case is -march=conroe which doesn't exist yet), though if you want a 64bit kernel I can't tell you exactly what to do.

concerning frequency scaling:

this shows Conroe seems to support EIST (Enhanced Intel SpeedStep Technology), see *note about availability down the page though.

p4 clockmod is a no-go, but you could try enhanced-speedstep module. once laoded you should have the usual 

/sys/devices/system/cpu/cpu*/cpufreq/ available.

----------

## lcj

Choose Intel Enchanced Speedsetp. Works fine on both cores, both with cupfreqd or gnome applet.

----------

## rhill

ok, i did one simple c++ benchmark using TraMP3d-v4.  keep in mind it's just one benchmark.

the system used was a Toshiba Satellite A100 laptop with a Core Duo T2300 @ 1.66GHz (Yonah), 2MiB shared L2 cache, and 1GiB of memory.  the GCC version used was 4.1-branch svn built yesterday.

-O2 -march=prescott -fomit-frame-pointer -pipe

```
dirtyepic@tycho ~/tmp $ /usr/bin/time /usr/bin/g++-4.1.2-pre20060923 -O2 -march=prescott -fomit-frame-pointer -pipe -Dleafify=flatten tramp3d-v4.cpp  -o tramp3d-v4-prescott

95.45user 0.84system 1:35.69elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k

0inputs+0outputs (0major+202080minor)pagefaults 0swaps

dirtyepic@tycho ~/tmp $ ./tramp3d-v4-prescott -n 25 --cartvis 1.0 0.0 --rhomin 1e-8

Using

  using [1,1,1] block setup for computation on domain [0:63:1,0:63:1,0:63:1]

  solving eeq

  time increments from [0, 1.79769e+308], cfl 0.5

  starting at t = 0, i = 1

  cell physical/total domain [0:62:1,0:62:1,0:62:1], [-2:64:1,-2:64:1,-2:64:1]

  face  physical/total domain [0:62:1,0:62:1,0:62:1], [-2:64:1,-2:64:1,-2:64:1]

  periodic boundaries in X Y Z

i = 1    t = 0.00209225  dt = 0.00209225 (0.07124s/it)

i = 2    t = 0.00410537  dt = 0.00201312 (0.946142s/it)

i = 3    t = 0.00603889  dt = 0.00193352 (0.966466s/it)

i = 4    t = 0.00794139  dt = 0.00190251 (0.975241s/it)

i = 5    t = 0.00984636  dt = 0.00190497 (0.97465s/it)

i = 6    t = 0.0117508   dt = 0.00190449 (0.985882s/it)

i = 7    t = 0.013681    dt = 0.00193011 (1.0047s/it)

i = 8    t = 0.0156598   dt = 0.0019788 (1.00467s/it)

i = 9    t = 0.0176706   dt = 0.00201081 (1.00171s/it)

i = 10   t = 0.0197364   dt = 0.0020658 (1.0184s/it)

i = 11   t = 0.0218716   dt = 0.0021352 (1.01445s/it)

i = 12   t = 0.0240721   dt = 0.00220057 (1.00954s/it)

i = 13   t = 0.0263471   dt = 0.002275 (1.01139s/it)

i = 14   t = 0.0287159   dt = 0.00236875 (1.01714s/it)

i = 15   t = 0.0311533   dt = 0.00243738 (1.01269s/it)

i = 16   t = 0.0336768   dt = 0.0025235 (1.01118s/it)

i = 17   t = 0.0362863   dt = 0.00260952 (1.00748s/it)

i = 18   t = 0.0389715   dt = 0.00268521 (1.00433s/it)

i = 19   t = 0.0417381   dt = 0.00276665 (1.00053s/it)

i = 20   t = 0.0445873   dt = 0.00284919 (1.00177s/it)

i = 21   t = 0.0475216   dt = 0.0029343 (0.989871s/it)

i = 22   t = 0.0505258   dt = 0.00300413 (0.997915s/it)

i = 23   t = 0.0535938   dt = 0.00306807 (0.98717s/it)

i = 24   t = 0.0567043   dt = 0.0031105 (0.989589s/it)

i = 25   t = 0.0598233   dt = 0.00311892 (0.987146s/it)

Time spent in iteration: 23.9913

Correctness:

        sum(rh) difference = 1.45519e-11

        sum(vx) = -0.242582

        sum(vy) = -0.295116

        sum(vz) = -0.335474

        sum(rh*T) difference = -297.099

dirtyepic@tycho ~/tmp $ analyze-x86 tramp3d-v4-prescott

Checking vendor_id string... GenuineIntel

Disassembling tramp3d-v4-prescott, please wait...

i486:    0 i586:    0 ppro:  130 mmx:    0 sse:    0 sse2:    0 sse3:    2

tramp3d-v4-prescott will run on Pentium IV (pentium4) w/ SSE3 or higher processor.
```

-O2 -march=pentium-m -fomit-frame-pointer -pipe

```
dirtyepic@tycho ~/tmp $ /usr/bin/time /usr/bin/g++-4.1.2-pre20060923 -O2 -march=pentium-m -fomit-frame-pointer -pipe -Dleafify=flatten tramp3d-v4.cpp  -o tramp3d-v4-pentiumm-plain

97.74user 0.74system 1:38.47elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k

0inputs+0outputs (11major+200253minor)pagefaults 0swaps

dirtyepic@tycho ~/tmp $ ./tramp3d-v4-pentiumm-plain -n 25 --cartvis 1.0 0.0 --rhomin 1e-8

Using

  using [1,1,1] block setup for computation on domain [0:63:1,0:63:1,0:63:1]

  solving eeq

  time increments from [0, 1.79769e+308], cfl 0.5

  starting at t = 0, i = 1

  cell physical/total domain [0:62:1,0:62:1,0:62:1], [-2:64:1,-2:64:1,-2:64:1]

  face  physical/total domain [0:62:1,0:62:1,0:62:1], [-2:64:1,-2:64:1,-2:64:1]

  periodic boundaries in X Y Z

i = 1    t = 0.00209225  dt = 0.00209225 (0.0692961s/it)

i = 2    t = 0.00410537  dt = 0.00201312 (0.992859s/it)

i = 3    t = 0.00603889  dt = 0.00193352 (1.0033s/it)

i = 4    t = 0.00794139  dt = 0.00190251 (0.975363s/it)

i = 5    t = 0.00984636  dt = 0.00190497 (0.98926s/it)

i = 6    t = 0.0117508   dt = 0.00190449 (0.986304s/it)

i = 7    t = 0.013681    dt = 0.00193011 (0.997433s/it)

i = 8    t = 0.0156598   dt = 0.0019788 (0.99804s/it)

i = 9    t = 0.0176706   dt = 0.00201081 (1.00585s/it)

i = 10   t = 0.0197364   dt = 0.0020658 (1.00463s/it)

i = 11   t = 0.0218716   dt = 0.0021352 (1.01035s/it)

i = 12   t = 0.0240721   dt = 0.00220057 (1.00643s/it)

i = 13   t = 0.0263471   dt = 0.002275 (1.00908s/it)

i = 14   t = 0.0287159   dt = 0.00236875 (1.00359s/it)

i = 15   t = 0.0311533   dt = 0.00243738 (1.00683s/it)

i = 16   t = 0.0336768   dt = 0.0025235 (1.0018s/it)

i = 17   t = 0.0362863   dt = 0.00260952 (1.00395s/it)

i = 18   t = 0.0389715   dt = 0.00268521 (0.994894s/it)

i = 19   t = 0.0417381   dt = 0.00276665 (0.995252s/it)

i = 20   t = 0.0445873   dt = 0.00284919 (0.992024s/it)

i = 21   t = 0.0475216   dt = 0.0029343 (0.989914s/it)

i = 22   t = 0.0505258   dt = 0.00300413 (0.984155s/it)

i = 23   t = 0.0535938   dt = 0.00306807 (0.986609s/it)

i = 24   t = 0.0567043   dt = 0.0031105 (0.981239s/it)

i = 25   t = 0.0598233   dt = 0.00311892 (0.986686s/it)

Time spent in iteration: 23.9751

Correctness:

        sum(rh) difference = 1.45519e-11

        sum(vx) = -0.242582

        sum(vy) = -0.295116

        sum(vz) = -0.335474

        sum(rh*T) difference = -297.099

dirtyepic@tycho ~/tmp $ analyze-x86 tramp3d-v4-pentiumm-plain                                                                          Checking vendor_id string... GenuineIntel

Disassembling tramp3d-v4-pentiumm-plain, please wait...

i486:    0 i586:    0 ppro:  135 mmx:    0 sse:    0 sse2:    4 sse3:    0

tramp3d-v4-pentiumm-plain will run on Pentium IV (pentium4) or higher processor.
```

-O2 -march=pentium-m -msse3 -fomit-frame-pointer -pipe

```
dirtyepic@tycho ~/tmp $ /usr/bin/time /usr/bin/g++-4.1.2-pre20060923 -O2 -march=pentium-m -msse3 -fomit-frame-pointer -pipe -Dleafify=flatten tramp3d-v4.cpp  -o tramp3d-v4-pentiumm

97.73user 1.01system 1:38.05elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k

0inputs+0outputs (0major+197280minor)pagefaults 0swaps

dirtyepic@tycho ~/tmp $ ./tramp3d-v4-pentiumm -n 25 --cartvis 1.0 0.0 --rhomin 1e-8

Using

  using [1,1,1] block setup for computation on domain [0:63:1,0:63:1,0:63:1]

  solving eeq

  time increments from [0, 1.79769e+308], cfl 0.5

  starting at t = 0, i = 1

  cell physical/total domain [0:62:1,0:62:1,0:62:1], [-2:64:1,-2:64:1,-2:64:1]

  face  physical/total domain [0:62:1,0:62:1,0:62:1], [-2:64:1,-2:64:1,-2:64:1]

  periodic boundaries in X Y Z

i = 1    t = 0.00209225  dt = 0.00209225 (0.069342s/it)

i = 2    t = 0.00410537  dt = 0.00201312 (0.968165s/it)

i = 3    t = 0.00603889  dt = 0.00193352 (0.985737s/it)

i = 4    t = 0.00794139  dt = 0.00190251 (0.999364s/it)

i = 5    t = 0.00984636  dt = 0.00190497 (1.01105s/it)

i = 6    t = 0.0117508   dt = 0.00190449 (1.01161s/it)

i = 7    t = 0.013681    dt = 0.00193011 (1.02449s/it)

i = 8    t = 0.0156598   dt = 0.0019788 (1.02412s/it)

i = 9    t = 0.0176706   dt = 0.00201081 (1.02851s/it)

i = 10   t = 0.0197364   dt = 0.0020658 (1.02592s/it)

i = 11   t = 0.0218716   dt = 0.0021352 (1.03424s/it)

i = 12   t = 0.0240721   dt = 0.00220057 (1.0353s/it)

i = 13   t = 0.0263471   dt = 0.002275 (1.03373s/it)

i = 14   t = 0.0287159   dt = 0.00236875 (1.03266s/it)

i = 15   t = 0.0311533   dt = 0.00243738 (1.03526s/it)

i = 16   t = 0.0336768   dt = 0.0025235 (1.02011s/it)

i = 17   t = 0.0362863   dt = 0.00260952 (1.0232s/it)

i = 18   t = 0.0389715   dt = 0.00268521 (1.02476s/it)

i = 19   t = 0.0417381   dt = 0.00276665 (1.0153s/it)

i = 20   t = 0.0445873   dt = 0.00284919 (1.00431s/it)

i = 21   t = 0.0475216   dt = 0.0029343 (1.00313s/it)

i = 22   t = 0.0505258   dt = 0.00300413 (0.989761s/it)

i = 23   t = 0.0535938   dt = 0.00306807 (0.99909s/it)

i = 24   t = 0.0567043   dt = 0.0031105 (0.989536s/it)

i = 25   t = 0.0598233   dt = 0.00311892 (0.996134s/it)

Time spent in iteration: 24.3848

Correctness:

        sum(rh) difference = 1.45519e-11

        sum(vx) = -0.242582

        sum(vy) = -0.295116

        sum(vz) = -0.335474

        sum(rh*T) difference = -297.099

dirtyepic@tycho ~/tmp $ analyze-x86 tramp3d-v4-pentiumm

Checking vendor_id string... GenuineIntel

Disassembling tramp3d-v4-pentiumm, please wait...

i486:    0 i586:    0 ppro:  135 mmx:    0 sse:    0 sse2:    0 sse3:    2

tramp3d-v4-pentiumm will run on Pentium IV (pentium4) w/ SSE3 or higher processor.
```

-O2 -march=pentium-m -msse3 -mfpmath=sse -fomit-frame-pointer -pipe

```
dirtyepic@tycho ~/tmp $ /usr/bin/time /usr/bin/g++-4.1.2-pre20060923 -O2 -march=pentium-m -msse3 -mfpmath=sse -fomit-frame-pointer -pipe -Dleafify=flatten tramp3d-v4.cpp  -o tramp3d-v4-pentiumm-sse

98.40user 0.94system 1:39.15elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k

0inputs+0outputs (3major+198438minor)pagefaults 0swaps

dirtyepic@tycho ~/tmp $ ./tramp3d-v4-pentiumm-sse -n 25 --cartvis 1.0 0.0 --rhomin 1e-8

Using

  using [1,1,1] block setup for computation on domain [0:63:1,0:63:1,0:63:1]

  solving eeq

  time increments from [0, 1.79769e+308], cfl 0.5

  starting at t = 0, i = 1

  cell physical/total domain [0:62:1,0:62:1,0:62:1], [-2:64:1,-2:64:1,-2:64:1]

  face  physical/total domain [0:62:1,0:62:1,0:62:1], [-2:64:1,-2:64:1,-2:64:1]

  periodic boundaries in X Y Z

i = 1    t = 0.00209225  dt = 0.00209225 (0.0617449s/it)

i = 2    t = 0.00410537  dt = 0.00201312 (0.897831s/it)

i = 3    t = 0.00603889  dt = 0.00193352 (0.964484s/it)

i = 4    t = 0.00794139  dt = 0.00190251 (0.94189s/it)

i = 5    t = 0.00984636  dt = 0.00190497 (0.972172s/it)

i = 6    t = 0.0117508   dt = 0.00190449 (0.973818s/it)

i = 7    t = 0.013681    dt = 0.00193011 (0.984364s/it)

i = 8    t = 0.0156598   dt = 0.0019788 (0.988743s/it)

i = 9    t = 0.0176706   dt = 0.00201081 (0.996885s/it)

i = 10   t = 0.0197364   dt = 0.0020658 (0.997118s/it)

i = 11   t = 0.0218716   dt = 0.0021352 (1.00016s/it)

i = 12   t = 0.0240721   dt = 0.00220057 (0.99685s/it)

i = 13   t = 0.0263471   dt = 0.002275 (0.998231s/it)

i = 14   t = 0.0287159   dt = 0.00236875 (1.00025s/it)

i = 15   t = 0.0311533   dt = 0.00243738 (0.987068s/it)

i = 16   t = 0.0336768   dt = 0.0025235 (0.981898s/it)

i = 17   t = 0.0362863   dt = 0.00260952 (0.990963s/it)

i = 18   t = 0.0389715   dt = 0.00268521 (0.986071s/it)

i = 19   t = 0.0417381   dt = 0.00276665 (0.980461s/it)

i = 20   t = 0.0445873   dt = 0.00284919 (0.982345s/it)

i = 21   t = 0.0475216   dt = 0.0029343 (1.00055s/it)

i = 22   t = 0.0505258   dt = 0.00300413 (0.995297s/it)

i = 23   t = 0.0535938   dt = 0.00306807 (1.00189s/it)

i = 24   t = 0.0567043   dt = 0.0031105 (1.00527s/it)

i = 25   t = 0.0598233   dt = 0.00311892 (1.01299s/it)

Time spent in iteration: 23.6994

Correctness:

        sum(rh) difference = 1.28966e-08

        sum(vx) = -0.242582

        sum(vy) = -0.295116

        sum(vz) = -0.335474

        sum(rh*T) difference = -297.099

dirtyepic@tycho ~/tmp $ analyze-x86 tramp3d-v4-pentiumm-sse

Checking vendor_id string... GenuineIntel

Disassembling tramp3d-v4-pentiumm-sse, please wait...

i486:    0 i586:    0 ppro:   84 mmx:   44 sse:    0 sse2: 3089 sse3:    0

tramp3d-v4-pentiumm-sse will run on Pentium IV (pentium4) or higher processor.
```

Keep in mind that anything that does strip-flags (ie. GCC, glibc, kernel, etc.) will remove both -msse3 and -mfpmath from your C[XX]FLAGS

Very little difference in runtimes, maybe half a second, and next to no difference in compile time.  Surprisingly, 

-O2 -march=pentium-m -msse3 -fomit-frame-pointer -pipe was the slowest.  I reran the test to be sure and it was slightly worse (24.5397s) than the original run.

It also appears -mfpmath=sse does not generate sse3 instructions.

----------

## Lloeki

hmm, some data  :Smile: 

as you mentioned before, there is some interest for Core Duo (yonah) to add -mfpmath=sse, as -march=pentium-m alone will by default favor x87 for performance:

 *Quote:*   

> -march=pentium-m prefers x87 over sse scalar code, because pentium-m can
> 
> decode sse at only half the rate of x87. You should see the speed
> 
> advantage clearly on pentium-m, presumably not on Core Duo.

 

so using -mfpmath=sse takes advantage of a full-rate sse on Core Duo, which is observed.

 *Quote:*   

>  It also appears -mfpmath=sse does not generate sse3 instructions.

 

this one is really interesting, too.

I have received my merom, but unfortunately at work a project is nearing completion (or should I say its deadline), so I don't even know if I'll have time to set it up, let alone benchmark it.

----------

## wu-s

 *Lloeki wrote:*   

> as a side note, mtune is redundant, as march implies enhancements of mtune, plus specifics. I don't know what precedence gcc gives to each one, but it may as well disable your glorious march optimisations, in favor of safer mtune ones.

 

Maybe just to stress that point. I think one has to distinguish between those two options. Carefully read the third paragraph for the "generic" cpu-type:

The gcc-manpage tells us:

 *Quote:*   

>        Intel 386 and AMD x86-64 Options
> 
>        These -m options are defined for the i386 and x86-64 family of comput-
> 
>        ers:
> ...

 

So here is my interpretation of the options' semantics: "-march" selects the assembler-instructions available for compiling, e.g. gcc _can_ use sse3 instructions if you select nocona. In contrast, "-mtune" is used to optimize the assembly for the given cpu-type under the allowed instructions from "-march".

It´s trivial that "-march" implies "-mtune" if not otherwise stated. _However_ you can optimize for a different cpu-type with the allowed instruction set from "-march", can´t you?

So

 *Quote:*   

> CFLAGS="-march=nocona -mtune=pentium-m -02 -pipe

 

as stated in http://gentoo-wiki.com/Safe_Cflags#Intel_Core_2_Solo.2FDuo_.28Allendale.2C_Conroe.2C_Merom.29 seems to be promising, right?

-dirtyepic: Maybe you can run some benchmarks with that setting?

Cheers,

Sven

----------

## magoscuro

any benchmark?

----------

## wu-s

The benchmarks by Dirtyepic couldn´t disprove the common thesis that the CFLAGS are not that crusial for the overall system performance. I will receive my conroe-box at the end of the week. Compared to my current Athlon Thunderbird 1.3GHz the performance increase will be amazing whatever gcc options are set.

A much more fundamental decision is between Gentoo/x86 and Gentoo/AMD64, which has also been addressed to in this thread. Maybe http://www.linuxhardware.org/article.pl?sid=06/08/22/0415251 is a first source of information. It´s a comparison between "-march=pentium-m -msse3 -O2 -pipe" on 32bit Linux against "-march=nocona -O2 -pipe" on 64bit.

To sum it up, Conroe is performing pretty well under 64bit Linux. I think you should give him a try. The issues with 32bit browser-plugins and video-codecs seem to be manageable.

wu

----------

## amattas

I agree with the previous poster, it does run nicely under 64 bit mode. Now the 965G motherboard on the other hand is a treat in itself to get running. It took much patching of the kernel, and ~AMD64 drivers to get all the hardware working

----------

## rmh3093

 *xentric wrote:*   

> I have the E6300 Core2 Duo (Allendale) in my system.
> 
> What's best to be used as "Processor Family" when configuring my kernel, Pentium-M or Pentium-4?
> 
> And does this processor support "CPU frequency scaling" with Intel Enhanced Speedstep or Intel Pentium-4 clock modulation?

 

enhanced speedstep is the best choice but acpu p-states would also work, not pentium-4 clock modulation...... p4  clock mod changes frequency only, enhanced speedstep changes voltage resulting in lower frequency and better power saving

----------

## irondog

I also did some benches of my CORE 2 Duo machine (Gentoo x86).

I did this in a tmpfs:

rm -f rand.gz && time gzip -c9 rand > rand.gz

-O2 -march=i686 -fomit-frame-pointer -pipe

```

real    0m22.760s

user    0m22.311s

sys     0m0.443s

```

-O2 -march=nocona -fomit-frame-pointer -pipe

```

real    0m26.353s

user    0m25.703s

sys     0m0.611s

```

-O2 -march=pentium-m -fomit-frame-pointer -pipe

```

real    0m22.796s

user    0m22.332s

sys     0m0.459s

```

-O2 -march=athlon-xp -fomit-frame-pointer -pipe

```

real    0m22.676s

user    0m22.205s

sys     0m0.473s

```

The only relevant thing to say is, you definitely don't want to use nocona on Core 2 Duo Gentoo x86. Besides that, I think it will hurt users on x86_64 also.

I discovered "by accident" that athlon-xp is the fastest (OK, the difference is about "nothing"). I moved from athlon-xp to Core 2 duo and I'm compiling for two boxes on my Core 2 duo system. I'll keep optimizing for my older computer (=athlon-xp) and the processor I sold recently  :Smile: . So, I',m not a fool playing around with -march=athlon-xp on an Intel system.

For the ricers interested (I don't know if it's safe):

-O3 -march=athlon-xp -fomit-frame-pointer -pipe

```

real    0m20.949s

user    0m20.453s

sys     0m0.489s

```

mfpmath=sse and -msse3 don't seem to influence the results very much for me. Maybe because of the -march=athlon-xp.

----------

## rhill

I spoke to someone at Intel who works on GCC and he confirmed that for Core Solo/Duo, -march=prescott is the correct microarchitecture.  Core 2 Solo/Duo (and if you're lucky enough to have one, the quad-core Core 2 Duo Extreme X6700) should use -march=nocona with GCC 4.1.

With GCC 4.2 you can use the new -march=core2 which enables the also new -mssse3 (say that three times fast) instruction set.

Lloeki:  i was wrong about -mfpmath=sse not generating sse3 instructions.  it just didn't with that particular code, which is weird but so is GCC sometimes   :Wink: 

irondog: you can't make general claims like that based on one benchmark, especially a I/O based one like gzip.   :Razz:   and SSE won't affect it because you're not doing anything that requires floating point calculations.

----------

## ECantona

dirtyepic: what about mobile processors like core 2 duo merom?

----------

## Lloeki

dirtyepic, 

indeed, that's great and interesting news  :Smile:  and so, I was wrong...

I'll put prescott for now on my gf's core duo and my own core 2 duo (remember, i'm going 32bits for now  :Wink:  ). but won't go as far as rebuilding world (no ricer mode here).

I guess we'll now have to wait for gcc 4.2 for some time, since I'm running stable...

anyway, thanks a lot for the research  :Smile: 

----------

## rhill

oops, make that GCC 4.3.   :Sad: 

ECantona: yeah, this is for all Core CPUs from Yonah to Merom to Kentsfield.

----------

## alphamaennchen

I hava a T7200 (Core2Duo 2,0 Mobile Version).

It is built into a Asus A8jp Notebook, together with ATI X1700.

Here is my makefile:

# These settings were set by the catalyst build script that automatically built

this stage

# Please consult /etc/make.conf.example for a more detailed example

CHOST="x86_64-pc-linux-gnu"

CFLAGS="-O2 -pipe"

CXXFLAGS="${CFLAGS}"

#

USE="aac acpi alsa apache2 arts avi beagle bzlib cdr dbus dmix directfb dvdcss d

vd

dvdread encode fam firefox ffmpeg fortran gif gpm gtk gtk2 hal jpeg kde

linguas_de mad math motif mmx mp3 mpeg mysql nls ntpl ntplonly nsplugin

ogg openal opengl oss pcre pdf pdflib php pnf png pstricks qt qt3 rtsp samba sdl

 slang

sockets sse sse2 sqlite threads truetype udev unicode usb userlocales utf8 vcd v

hosts

win32codecs xine xml xscreensaver xv xvid zlib X

-esd -fPIC"

#

ACCEPT_KEYWORDS="amd64"

MAKEOPTS="-j3"

GENTOO_MIRRORS="ftp://sunsite.informatik.rwth-aachen.de/pub/Linux/gentoo ftp://f

tp.uni-erlangen.de/pub/mirrors/gentoo "

#

LINGUAS="de en"

FEATURES="parallel-fetch"

#

ALSA_CARDS="hda-intel"

#

VIDEO_CARDS="fglrx"

INPUT_DEVICES="keyboard synaptics mouse"

#

PORTDIR_OVERLAY="/usr/local/portage"

source /usr/portage/local/layman/make.conf

Everything is fine!

And: using ati-drivers-8.29.6 works fine for me, later versions don't!

No compile errors, speed is goot... only some problems with snd_hda but that is another topic...

----------

## Dirk.R.Gently

Any luck resolving this?  The MacBook Wiki actually recommends the nacona for 32 bit.

----------

## llavalle

fyi, take a look at this page :

http://gcc.gnu.org/gcc-4.3/changes.html

 *Quote:*   

> 
> 
> New Targets and Target Specific Improvements
> 
> IA-32/x86-64
> ...

 

----------

## alphamaennchen

And it works flawlessly.

Let me know if you need make.conf or else...

----------

## crisandbea

Blank I have as soon as bought notebook dell a D620 with Core2-Duo T7200 (Merom),

 I ask you some councils: 

1) to use the minimal-cd x86 or amd64? 

2) which CFLAGS to set up? 

thanks

----------

## michel7

 *crisandbea wrote:*   

> Blank I have as soon as bought notebook dell a D620 with Core2-Duo T7200 (Merom),
> 
>  I ask you some councils: 
> 
> 1) to use the minimal-cd x86 or amd64? 
> ...

 

1) i would suggest to use x86 because its more safely

2) my CFLAGS on my T7200 (MEROM) are: CFLAGS="-march=prescott -O2 -pipe -fomit-frame-pointer"

and my system is very stable, no compilation issuess and other complains ...

----------

## progman32

Here is an article on LinuxHardware.org showing some data (benchmarks). It's really a comparison between different CPUs, but it has some hard data on 64 vs 32 bit for those wondering about the performance differences, and, most importantly, the GCC flags that were used in said benchmarks. 

Notice, however, that they didn't use -msse3  or -mfpmath=sse. Would be interesting to know what difference it makes.

Also, anyone know what CPU setting to use in the kernel, as asked above? 

I'm waiting for my new core 2 duo, once I have Gentoo installed I will post some benchmarks. Anyone know a good way of benchmarking kernel performance in various tasks?

----------

## lodewj

I just wanted to try gcc-4.3.0_alpha20070817 on my testserver.

Intel pentium dual core E2160 (a 1Mb L2 cache conroe).

```

Portage 2.1.2.12 (default-linux/amd64/2007.0/server, gcc-4.1.2, glibc-2.5-r4, 2.6.22-gentoo-r2 x86_64)

=================================================================

System uname: 2.6.22-gentoo-r2 x86_64 Genuine Intel(R) CPU 2160 @ 1.80GHz

Gentoo Base System release 1.12.9

Timestamp of tree: Wed, 22 Aug 2007 11:50:01 +0000

dev-lang/python:     2.4.4-r4

dev-python/pycrypto: 2.0.1-r6

sys-apps/sandbox:    1.2.17

sys-devel/autoconf:  2.13, 2.61

sys-devel/automake:  1.7.9-r1, 1.9.6-r2, 1.10

sys-devel/binutils:  2.17

sys-devel/gcc-config: 1.3.16

sys-devel/libtool:   1.5.24

virtual/os-headers:  2.6.21

ACCEPT_KEYWORDS="amd64"

AUTOCLEAN="yes"

CBUILD="x86_64-pc-linux-gnu"

CFLAGS="-march=nocona -msse3 -mfpmath=sse -O2 -fomit-frame-pointer -pipe"

CHOST="x86_64-pc-linux-gnu"

CONFIG_PROTECT="/etc"

CONFIG_PROTECT_MASK="/etc/env.d /etc/gconf /etc/php/apache2-php5/ext-active/ /etc/php/cgi-php5/ext-active/ /etc/php/cli-php5/ext-active/ /etc/revdep-rebuild /etc/terminfo"

CXXFLAGS="-march=nocona -msse3 -mfpmath=sse -O2 -fomit-frame-pointer -pipe"

DISTDIR="/usr/portage/distfiles"

FEATURES="ccache distlocks metadata-transfer sandbox sfperms strict"

GENTOO_MIRRORS="ftp.belnet.be/linux/gentoo ftp.snt.utwente.nl/pub/linux/gentoo"

MAKEOPTS="-j3"

PKGDIR="/usr/portage/packages"

PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --delete-after --stats --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages --filter=H_**/files/digest-*"

PORTAGE_TMPDIR="/var/tmp"

PORTDIR="/usr/portage"

PORTDIR_OVERLAY="/usr/local/portage"

SYNC="rsync://rsync.europe.gentoo.org/gentoo-portage"

USE="/ 3dnow 3dnowext acl acpi amd64 apache2 berkdb bitmap-fonts bzip2 cgi cli cracklib crypt dri fortran gdbm glibc-omitfp gpm hash iconv ipv6 isdnlog kerberos midi mmx mmxext mudflap ncurses nls nptl nptlonly openmp openntpd pam pcre perl php posix postgres pppd python readline reflection samba session spl sse sse2 ssl tcpd truetype truetype-fonts type1-fonts unicode xml xorg zip zlib" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" ALSA_PCM_PLUGINS="adpcm alaw asym copy dmix dshare dsnoop empty extplug file hooks iec958 ioplug ladspa lfloat linear meter mulaw multi null plug rate route share shm softvol" ELIBC="glibc" INPUT_DEVICES="keyboard mouse evdev" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" USERLAND="GNU" VIDEO_CARDS="apm ark chips cirrus cyrix dummy fbdev glint i128 i810 mach64 mga neomagic nv r128 radeon rendition s3 s3virge savage siliconmotion sis sisusb tdfx tga trident tseng v4l vesa vga via vmware voodoo"

Unset:  CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, LANG, LC_ALL, LDFLAGS, LINGUAS, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS

```

but compiling gcc fails   :Sad: 

these are the last lines:

```

/var/tmp/portage/sys-devel/gcc-4.3.0_alpha20070817/work/build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/char_traits.h: In static member function 'static $

/var/tmp/portage/sys-devel/gcc-4.3.0_alpha20070817/work/build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/char_traits.h:141: error: 'EOF' was not declared $

/var/tmp/portage/sys-devel/gcc-4.3.0_alpha20070817/work/build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/char_traits.h: In static member function 'static $

/var/tmp/portage/sys-devel/gcc-4.3.0_alpha20070817/work/build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/char_traits.h:293: error: 'EOF' was not declared $

make[4]: *** [codecvt.lo] Error 1

make[4]: Leaving directory `/var/tmp/portage/sys-devel/gcc-4.3.0_alpha20070817/work/build/x86_64-pc-linux-gnu/libstdc++-v3/src'

make[3]: *** [all-recursive] Error 1

make[3]: Leaving directory `/var/tmp/portage/sys-devel/gcc-4.3.0_alpha20070817/work/build/x86_64-pc-linux-gnu/libstdc++-v3'

make[2]: *** [all] Error 2

make[2]: Leaving directory `/var/tmp/portage/sys-devel/gcc-4.3.0_alpha20070817/work/build/x86_64-pc-linux-gnu/libstdc++-v3'

make[1]: *** [all-target-libstdc++-v3] Error 2

make[1]: Leaving directory `/var/tmp/portage/sys-devel/gcc-4.3.0_alpha20070817/work/build'

make: *** [profiledbootstrap] Error 2

!!! ERROR: sys-devel/gcc-4.3.0_alpha20070817 failed.

Call stack:

  ebuild.sh, line 1638:   Called dyn_compile

  ebuild.sh, line 985:   Called qa_call 'src_compile'

  ebuild.sh, line 44:   Called src_compile

  ebuild.sh, line 1328:   Called toolchain_src_compile

  toolchain.eclass, line 26:   Called gcc_src_compile

  toolchain.eclass, line 1546:   Called gcc_do_make

  toolchain.eclass, line 1420:   Called die

!!! emake failed with profiledbootstrap

!!! If you need support, post the topmost build error, and the call stack if relevant.

!!! A complete build log is located at '/var/tmp/portage/sys-devel/gcc-4.3.0_alpha20070817/temp/build.log'.

```

you can find the full output of build.log here.

any ideas?

----------

