Re: NAMD benchmark results for dual-Nehalem, was Re: questions regarding new single node box for NAMD: re components, CUDA, Fermi

From: Dow Hurst (Dow.Hurst_at_mindspring.com)
Date: Thu Dec 10 2009 - 05:14:54 CST

Biff,
We recently put together a cluster designed to run NAMD primarily on
CPUs but with an eye to upgrading to GPUs if the performance panned
out. The cluster has dual quad core 2.5GHz Xeon 5420 cpus connected
with QLogic Infinipath QLE7280 DDR cards and a 96 port QLogic 9080
Silverstorm switch. The nodes have a PCI-Express gen2 16x slot for the
Infinipath card to maximize bandwidth and lower latency. Leaving one
core free to manage the interconnect really helps out and pushing the
PME management on to an additional core made an additional improvement.
The sweet spot for this simulation is 350 compute cores using 7 cores
per node, not eight, and one extra core to manage the PME. I'm using
the "twoAwayX yes" option to bump up the number of patches, and the
"ldbUnloadZero yes" option to offload the PME in our NAMD config file.
We've tested the CPU and IB interconnects with NAMD 2.6 and have
achieved 15.5 ns/day on 351 cores on a 95,874 atom simulation. (I
apologize for not having apoa1 numbers!) NAMD reported a benchmark of
0.00548014 (seconds)/step or 1.96 cpu s/step for this run. What we've
found is slower cpus with lot's of onboard cache combined with a fast
interconnect perform very well when scaling up the number nodes in the
calculation.

As described in the NAMD wiki you can find the sweet spot for a system
by grepping the NAMD log file for the string "PATCH GRID". Multiply the
three values you find in the line to get the total number of patches
that run is going to use. Each patch needs one core for the best
performance, so the apoa1 benchmark really needs 144 cores. Then, add
one more core for PME management if your simulation requires it. Our
simulation is a box containing a G-protein coupled receptor suspended in
a lipid bilayer with explicit waters above and below the bilayer and so
required PME. Managing the number of patches is a juggling game that
requires figuring out the cpu cores available to you, the number of
patches the simulation is split up into, the number of patches when
using the twoAway keyword, or even adding the twoAwayY keyword. Varying
the simulation cell size by adding or reducing atoms is another way to
change the number of patches.

Once you find the "sweet spot", accommodating the needed bandwidth with
the lowest latency is the next step. We avoided higher latencies by
going with the QLogic Infinipath cards as they were shown in a white
paper from SC2008 to slowly and smoothly increase in latency as packet
size increased. (You can see this effect in the performance benchmark
web page on the NAMD site) We also limited the number of switch hops
between nodes in the IB network by using one large switch.

We haven't yet tested the two Tesla's in the cluster but plan to do that
soon. Each Tesla is shared between two nodes since the PCI-Express bus
is probably going to be the limiting factor. I did some testing with
the 2.7b1 code and the apoa1 benchmark on the BASS cluster at UNC which
led me to think that one GPU per IB card was best for NAMD with DDR
IB. I hope to test 2.7b2 on that cluster and get numbers with the
production level code. I looked back at the numbers I generated and 28
nodes, 28 GPUs, and 28 cores, or 1 core/1 GPU/1 IB card gave the best
numbers for the apoa1 benchmark with the 2.7b1 code from Jan 2009. What
was interesting to me was that there was another peak in performance at
60 nodes, 60 GPUs, and 60 cpus. And also that 28 nodes, 56 GPUs, and
28 IB cards or 2 cores/2 GPUs/1 IB card were similar in performance to
the best numbers generated with 28 nodes using 1 core/1 GPU/1 IB card.
There is going to be a slightly more complex formula to figure out the
best patch/(cpu/GPU) ratio.

Here is the basic specs for our cluster. The compute node is cheap:
(1) ASUS DSEB-DG motherboard
(1) 2U chassis
(2) Xeon E5420 2.5GHz
(4) 1GB DDR2 667MHz ECC/FB RAM
(1) QLE7280 Infinipath card
Cost $2010.33

IB switch is not cheap:
(1) 96 port chassis
(6) 12 port blades
Cost $57655.79

We decided to put a 1U spacer in between each node and use 2 extra racks
when designing the cluster. This way we can add a Tesla later without a
lot of work. The cluster has 64 compute nodes in total and a couple of
head nodes. We just couldn't afford the Nehalem platform and still keep
a decent number of compute nodes.
Best wishes,
Dow Hurst

Biff Forbush wrote:
> Hi All,
>
> Following up on an attempt to build a reasonably fast single-box MD
> machine, I have just assembled a dual-3.33 GHz Xeon (Nehalem; details
> at the end) system and carried out a set of benchmarks (No GPUs at
> this point). I was interested to check a concern about CPU speed and
> main memory bandwith, expressed earlier by Axel:
>> please note that i didn't refer to cost/ghz but cost/performance ratio.
>> the higher the clock rate, the less you get out of it due to the
>> severe imbalance between the cpu and the main memory performance.
>>
> I changed CPU speed by changing the clock multiplier (15, 20, or the
> original 25), which as I understand it affects the CPU clock but not
> the memory subsystem -- in any case, I think this accurately simulates
> using a cheaper and slower CPU.
>
> The bottom lines are that
> (1) performance is strictly proportional to CPU clock rate between 2.0
> and 3.33 GHz at all "+p" values. Apparently the architecture
> improvements in Nehalem have fixed earlier memory bottlenecks.
> (2) NAMD scaling efficiency drops to about 60% on going from +p4 to
> +p8 and then holds fairly steady to +p16 (see more detailed steps at
> the very end of the message) -- puzzling that the drop is this early.
> Here are the raw values for the default apoa1 benchmark in seconds/step:
>
> Apoa1, 92,224 atoms.
> http://www.ks.uiuc.edu/Research/namd/performance.html
> 2_GHz 2.6_GHz 3.3_GHz
> +p1 2.725 2.048 1.643
> +p2 1.400 1.050 0.842
> +p4 0.709 0.530 0.428
> +p8 0.544 0.410 0.319
> +p12 0.373 0.281 0.224
> +p16 0.295 0.223 0.179 seconds-per-step
>
> Normalizing these values to 2GHz with +p1 we see:
> 2_GHz 2.6_GHz 3.3_GHz
> +p1 1.000 0.998 0.995
> +p2 0.973 0.973 0.970
> +p4 0.961 0.965 0.954
> +p8 0.626 0.623 0.641
> +p12 0.609 0.606 0.608
> +p16 0.577 0.573 0.570 2-way normalized values
>
> Essentially the same result (for scaling and CPU speeds) was obtained
> for the larger F1ATPase benchmark (results are near the end of this
> message; but the stmv benchmark just quits "Program finished" after
> the Startup phase).
>
> I am about to install two GTX-295s in this system (purchased before
> detailed news of Fermi), and am looking forward to the bleeding edge
> of the namd/cuda world. This time I am not overly optimistic, based
> on previous discussions on this forum, and on the sound advice of Axel:
>
>> there is a lot of "hope" in your statements. i have learned to be
>> more paranoid over time and don't go with theoretical
>> possibilities.
>
> Regards,
> Biff
>
> --------------------------
>
> Hardware:
> (2) Intel Xeon W5590 Nehalem-EP 3.33GHz
> (6) 2GB DDR3 1333 MHz Kingston sdram
> Tyan S7025 Motherboard
> under Windows 7 (64bit)
> NAMD 2.72b
>
> Core temperatures stay below 65oC and everything is very quiet (so
> far, but without GPUs!) with:
> Silverstone ST1500 1500W power supply
> (2) Scythe Ninja 2 Rev. B cpu coolers (installation issues here)
> Cooler Master Cosmos S case
>
> ------------------------------
>
> F1ATPase benchmark results (327,506 atoms)
> http://www.ks.uiuc.edu/Research/namd/utilities/f1atpase/
> 2_GHz 2.6_GHz 3.3_GHz
> +p1 8.271 6.240 5.003
> +p2 4.167 3.130 2.539
> +p4 2.099 1.582 1.297
> +p8 1.563 1.185 0.967
> +p12 1.075 0.823 0.669
> +p16 0.896 0.670 0.539 seconds-per-step
>
> 2_GHz 2.6_GHz 3.3_GHz
> +p1 1.000 0.994 0.992
> +p2 0.993 0.991 0.977
> +p4 0.985 0.980 0.957
> +p8 0.662 0.655 0.641
> +p12 0.641 0.628 0.618
> +p16 0.577 0.578 0.575 2-way normalized values
>
> ------------------------------
>
> Details of drop in scaling efficiency as Dual Xeons approach 100%
> utilization:
> ApoA1, as above (3.33 GHz)
> +pN s/step scaling [normalizing s/step by number of processes]
> 1 1.643 1.000
> 2 0.842 0.975
> 3 0.570 0.962
> 4 0.428 0.959
> 5 0.394 0.834
> 6 0.402 0.681
> 7 0.369 0.636
> 8 0.336 0.611
> 9 0.303 0.603
> 10 0.273 0.601
> 11 0.251 0.594
> 12 0.227 0.604
> 13 0.212 0.595
> 14 0.201 0.583
> 15 0.188 0.584
> 16 0.180 0.569
>
>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:53:35 CST