NAMD benchmark results for dual-Nehalem, was Re: questions regarding new single node box for NAMD: re components, CUDA, Fermi

From: Biff Forbush (biff.forbush_at_yale.edu)
Date: Wed Dec 09 2009 - 20:53:00 CST

Hi All,

    Following up on an attempt to build a reasonably fast single-box MD
machine, I have just assembled a dual-3.33 GHz Xeon (Nehalem; details at
the end) system and carried out a set of benchmarks (No GPUs at this
point). I was interested to check a concern about CPU speed and main
memory bandwith, expressed earlier by Axel:
> please note that i didn't refer to cost/ghz but cost/performance ratio.
> the higher the clock rate, the less you get out of it due to the
> severe imbalance between the cpu and the main memory performance.
>
I changed CPU speed by changing the clock multiplier (15, 20, or the
original 25), which as I understand it affects the CPU clock but not the
memory subsystem -- in any case, I think this accurately simulates using
a cheaper and slower CPU.

The bottom lines are that
(1) performance is strictly proportional to CPU clock rate between 2.0
and 3.33 GHz at all "+p" values. Apparently the architecture
improvements in Nehalem have fixed earlier memory bottlenecks.
(2) NAMD scaling efficiency drops to about 60% on going from +p4 to +p8
and then holds fairly steady to +p16 (see more detailed steps at the
very end of the message) -- puzzling that the drop is this early. Here
are the raw values for the default apoa1 benchmark in seconds/step:

Apoa1, 92,224 atoms. http://www.ks.uiuc.edu/Research/namd/performance.html
       2_GHz 2.6_GHz 3.3_GHz
+p1 2.725 2.048 1.643
+p2 1.400 1.050 0.842
+p4 0.709 0.530 0.428
+p8 0.544 0.410 0.319
+p12 0.373 0.281 0.224
+p16 0.295 0.223 0.179 seconds-per-step

Normalizing these values to 2GHz with +p1 we see:
    2_GHz 2.6_GHz 3.3_GHz
+p1 1.000 0.998 0.995
+p2 0.973 0.973 0.970
+p4 0.961 0.965 0.954
+p8 0.626 0.623 0.641
+p12 0.609 0.606 0.608
+p16 0.577 0.573 0.570 2-way normalized values

Essentially the same result (for scaling and CPU speeds) was obtained
for the larger F1ATPase benchmark (results are near the end of this
message; but the stmv benchmark just quits "Program finished" after the
Startup phase).

I am about to install two GTX-295s in this system (purchased before
detailed news of Fermi), and am looking forward to the bleeding edge of
the namd/cuda world. This time I am not overly optimistic, based on
previous discussions on this forum, and on the sound advice of Axel:

> there is a lot of "hope" in your statements. i have learned
> to be more paranoid over time and don't go with theoretical
> possibilities.

Regards,
Biff

--------------------------

Hardware:
(2) Intel Xeon W5590 Nehalem-EP 3.33GHz
(6) 2GB DDR3 1333 MHz Kingston sdram
Tyan S7025 Motherboard
under Windows 7 (64bit)
NAMD 2.72b

Core temperatures stay below 65oC and everything is very quiet (so far,
but without GPUs!) with:
Silverstone ST1500 1500W power supply
(2) Scythe Ninja 2 Rev. B cpu coolers (installation issues here)
Cooler Master Cosmos S case

------------------------------

F1ATPase benchmark results (327,506 atoms)
http://www.ks.uiuc.edu/Research/namd/utilities/f1atpase/
     2_GHz 2.6_GHz 3.3_GHz
+p1 8.271 6.240 5.003
+p2 4.167 3.130 2.539
+p4 2.099 1.582 1.297
+p8 1.563 1.185 0.967
+p12 1.075 0.823 0.669
+p16 0.896 0.670 0.539 seconds-per-step

      2_GHz 2.6_GHz 3.3_GHz
+p1 1.000 0.994 0.992
+p2 0.993 0.991 0.977
+p4 0.985 0.980 0.957
+p8 0.662 0.655 0.641
+p12 0.641 0.628 0.618
+p16 0.577 0.578 0.575 2-way normalized values

------------------------------

Details of drop in scaling efficiency as Dual Xeons approach 100%
utilization:
ApoA1, as above (3.33 GHz)
+pN s/step scaling [normalizing s/step by number of processes]
1 1.643 1.000
2 0.842 0.975
3 0.570 0.962
4 0.428 0.959
5 0.394 0.834
6 0.402 0.681
7 0.369 0.636
8 0.336 0.611
9 0.303 0.603
10 0.273 0.601
11 0.251 0.594
12 0.227 0.604
13 0.212 0.595
14 0.201 0.583
15 0.188 0.584
16 0.180 0.569

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:53:35 CST