NAMD2.9 single-node benchmarks, 0-2 Kepler GPU's.

From: Aaron Cann (
Date: Tue Apr 30 2013 - 21:17:02 CDT

Hello all, I thought Id write some of my experiences setting up a basic
NAMD 2.9 GPU workstation. Lots of benchmarks, and some conclusions and a
few questions for the illuminati.

SETUP: System is an Intel LGA 2011 system with two Nvidia GTX 4G 670s in
the x16 slots. Theyre running at PCIE 2.0 thanks to the strange snafus
with Sandy Bridge-E CPUs at 3.0 speeds. CPU is a 4-core hyperthreaded i7,
3.6GHz. Running Ubuntu 13.04, NAMD 2.9, either with or without CUDA, 64 bit,
latest NVIDIA drivers. Displays were hanging off the GPUs, not doing
anything during the runs. Switching to console mode didnt change anything.
Deliberately loading the GPU with a large VMD rotation slowed runs down.

Note that I cite thread numbers: up to 8 threads on 4 cores. > 4 threads =
fake extra CPUs.

Standard namd benchmarks except outputEnergies=600. Dhfr was adapted from
the AMBER benchmark by Charles Brooks, 2 fs timestep.

STMV benchmark.

Ns/day. T= # threads, (may be 2x # of cores.)


1 0.099 0.100

2 0.151 0.193

3 0.175 0.220

4 0.182 0.282

8 0.186 0.282


STMV is a large dataset. 2 threads gets 94% of the horsepower out of 1 gpu,
and moving from 2T/1G 4T/2G gives pretty good scaling with this dataset
(94% of doubled output). This dataset looks largely GPU bound, although a
six core CPU would still have been slightly better. Adding a 3rd GPU here
(on the existing four core CPU) would be an inefficient use of the 3rd gpu.





1 0.12 1.10 1.10

2 0.24 1.94 2.18

3 0.33 2.21 2.79

4 0.30 2.31 3.10

8 0.31 2.28 3.70

 Moving from 1 threads on 1 GPU to 2T/2G again has excellent scaling, 99% of
doubled output, although most of the increase was from the second core, not
the GPU. Getting to 96% of peak output of 1 GPU required 3 threads, not
two. Moving from 2 threads/1GPU to 4/2 gave only an 80% speedup, suggesting
communications was becoming an issue instead of GPU horsepower. Scaling
from 2T/1G 4T/2G didnt work as well, about 80% scaling.

Seven-fold speedup with 1 GPU, 12 with 2, not bad.




1 1.65 6.05 6.04

2 3.2 11.4 11.2

3 4.3 15.2 15.1

4 4.2 14.5 18.1

8 3.5 14.6 21.7

This small job was even more dependent on the CPU. I dont really trust the
4th core because of system tasks (even though the system was otherwise idle,
of course.) The system is CPU bound, but I dont understand why scaling was
so poor > 2 cores.

The second GPU doesnt do a whole lot (about 35% more max speed). Even with
this small dataset, the GPU gave an almost 4-fold boost over this modest
$300 CPU. (No 16 core Opterons here.)


Conclusions: CPU matters in NAMD!


Ideally, Id like to have at least three cores per 670 GPU on all three of
these datasets, matters more in the smaller dataset. Although I didnt test
different GPUs, Id expect faster GPUs to require higher core/GPU ratios
as well.

Scaling to the fourth core never seems to work well, even without GPU at
all. Moving from 7 to 8 threads usually showed a drop in performance.
Probably I need a dedicated system core, or is this a general phenomenon?
(Doubt its general given the number of clusters that seem to scale well
with 10000s of cores on infiniband; intercore communication within a CPU is
probably faster. Its probably context switching from system daemons that
arent registering in the unix utility top.)

NVIDIA quotes scaling to faster GPUs (K20x)using a node with 16 available
Xeon cores per GPU.
( A
K20x is about twice as many CUDA cores as a 670 at a slightly lower clock
rate. The quote a 4 ns/day APOA1 rate; Im getting 2.3 on a single 670 with
the CPU cores. Not bad.

My fold-speedups are higher than NVIDIA quotes, because Im using such a
lame CPU (compared to dual-eight core Xeons.)


AMBER comparison question:

AMBER seems to scale very poorly across multiple GPUS but is reportedly
almost CPU-independent and has much faster published benchmarks.
( Comparing systems is quite
difficult and I dont think I have a detailed understanding of either
package yet to compare the settings, although the run was designed by
someone with more experience then me. A single GTX 680 was quoted at about
60ns/day (that GPU is about 10% more powerful than a 670 used here.) One 670
here gave 15ns/day. Not dealing with this core/GPU balancing act sounds
really nice right now. NAMD clearly more scaling over multiple nodes, but
for a lab-scale cluster or individual workstation, should I be looking more
at AMBER in terms of speed and lower equipment costs (no 16 core CPUs
required for multiple GPUs)?


Thanks, Aaron Cann




This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:21:09 CST