Re: NAMD benchmark results for dual-Nehalem, was Re: questions regarding new single node box for NAMD: re components, CUDA, Fermi

From: Gengbin Zheng (gzheng_at_illinois.edu)
Date: Thu Dec 10 2009 - 11:54:33 CST

There should not be that substantial performance drop from 4 to 8 cores
for NAMD on Nehalem.
Here is the NAMD apoa1 performance data that I have on Linux:

Ubuntu Linux on Nehalem 2xquad core 2.26GHz (single node)
   
     1 1.7 s/step
     2 0.879
     4 0.46
     8 0.245
     16 0.204

 As expected, the performance flats out at 16 cores (or beyond +p8) when
hyperthreading is used. But it scales well from 4 to 8. This result is
with cpu affinity enabled I think. Just run namd with command line
option +setcpuaffinity. This option also works for windows, and you can
give it a try.

  I don't have official NAMD performance data for windows. But I did a
port for Charm++/NAMD on their HPC cluster using Microsoft MPI (msmpi)
for Microsoft. NAMD in general scales pretty well to 256 cores (total
16 nodes) on their 16 core Nehalem Windows HPC server with
hyperthreading disabled, and the performance is comparable to what we
had on Linux. I don't know anything about running NAMD on Windows 7,
but I heard it is a heavy operating system that may not be as friendly
for parallel computing as windows HPC server. If you are building a
large windows cluster, windows HPC server may be the way to go.

Gengbin

Biff Forbush wrote:
> Hi All,
>
> Following up on an attempt to build a reasonably fast single-box MD
> machine, I have just assembled a dual-3.33 GHz Xeon (Nehalem; details
> at the end) system and carried out a set of benchmarks (No GPUs at
> this point). I was interested to check a concern about CPU speed and
> main memory bandwith, expressed earlier by Axel:
>> please note that i didn't refer to cost/ghz but cost/performance ratio.
>> the higher the clock rate, the less you get out of it due to the
>> severe imbalance between the cpu and the main memory performance.
>>
> I changed CPU speed by changing the clock multiplier (15, 20, or the
> original 25), which as I understand it affects the CPU clock but not
> the memory subsystem -- in any case, I think this accurately simulates
> using a cheaper and slower CPU.
>
> The bottom lines are that
> (1) performance is strictly proportional to CPU clock rate between 2.0
> and 3.33 GHz at all "+p" values. Apparently the architecture
> improvements in Nehalem have fixed earlier memory bottlenecks.
> (2) NAMD scaling efficiency drops to about 60% on going from +p4 to
> +p8 and then holds fairly steady to +p16 (see more detailed steps at
> the very end of the message) -- puzzling that the drop is this early.
> Here are the raw values for the default apoa1 benchmark in seconds/step:
>
> Apoa1, 92,224 atoms.
> http://www.ks.uiuc.edu/Research/namd/performance.html
> 2_GHz 2.6_GHz 3.3_GHz
> +p1 2.725 2.048 1.643
> +p2 1.400 1.050 0.842
> +p4 0.709 0.530 0.428
> +p8 0.544 0.410 0.319
> +p12 0.373 0.281 0.224
> +p16 0.295 0.223 0.179 seconds-per-step
>
> Normalizing these values to 2GHz with +p1 we see:
> 2_GHz 2.6_GHz 3.3_GHz
> +p1 1.000 0.998 0.995
> +p2 0.973 0.973 0.970
> +p4 0.961 0.965 0.954
> +p8 0.626 0.623 0.641
> +p12 0.609 0.606 0.608
> +p16 0.577 0.573 0.570 2-way normalized values
>
> Essentially the same result (for scaling and CPU speeds) was obtained
> for the larger F1ATPase benchmark (results are near the end of this
> message; but the stmv benchmark just quits "Program finished" after
> the Startup phase).
>
> I am about to install two GTX-295s in this system (purchased before
> detailed news of Fermi), and am looking forward to the bleeding edge
> of the namd/cuda world. This time I am not overly optimistic, based
> on previous discussions on this forum, and on the sound advice of Axel:
>
>> there is a lot of "hope" in your statements. i have learned to be
>> more paranoid over time and don't go with theoretical
>> possibilities.
>
> Regards,
> Biff
>
> --------------------------
>
> Hardware:
> (2) Intel Xeon W5590 Nehalem-EP 3.33GHz
> (6) 2GB DDR3 1333 MHz Kingston sdram
> Tyan S7025 Motherboard
> under Windows 7 (64bit)
> NAMD 2.72b
>
> Core temperatures stay below 65oC and everything is very quiet (so
> far, but without GPUs!) with:
> Silverstone ST1500 1500W power supply
> (2) Scythe Ninja 2 Rev. B cpu coolers (installation issues here)
> Cooler Master Cosmos S case
>
> ------------------------------
>
> F1ATPase benchmark results (327,506 atoms)
> http://www.ks.uiuc.edu/Research/namd/utilities/f1atpase/
> 2_GHz 2.6_GHz 3.3_GHz
> +p1 8.271 6.240 5.003
> +p2 4.167 3.130 2.539
> +p4 2.099 1.582 1.297
> +p8 1.563 1.185 0.967
> +p12 1.075 0.823 0.669
> +p16 0.896 0.670 0.539 seconds-per-step
>
> 2_GHz 2.6_GHz 3.3_GHz
> +p1 1.000 0.994 0.992
> +p2 0.993 0.991 0.977
> +p4 0.985 0.980 0.957
> +p8 0.662 0.655 0.641
> +p12 0.641 0.628 0.618
> +p16 0.577 0.578 0.575 2-way normalized values
>
> ------------------------------
>
> Details of drop in scaling efficiency as Dual Xeons approach 100%
> utilization:
> ApoA1, as above (3.33 GHz)
> +pN s/step scaling [normalizing s/step by number of processes]
> 1 1.643 1.000
> 2 0.842 0.975
> 3 0.570 0.962
> 4 0.428 0.959
> 5 0.394 0.834
> 6 0.402 0.681
> 7 0.369 0.636
> 8 0.336 0.611
> 9 0.303 0.603
> 10 0.273 0.601
> 11 0.251 0.594
> 12 0.227 0.604
> 13 0.212 0.595
> 14 0.201 0.583
> 15 0.188 0.584
> 16 0.180 0.569
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:51:47 CST