Single-node GPU benchmarking results



ApoA1 NVE benchmark ApoA1 NPT benchmark

config file: apoa1_nve_cuda.namd

ApoA1 benchmark (92,224 atoms, periodic; 2fs timestep with rigid bonds, 12A cutoff with PME every 2 steps):
directory, compressed tar archive (2.8 MB), zip archive (2.7 MB).


STMV NVE benchmark STMV NPT benchmark

config file: stmv_nve_cuda.namd

STMV benchmark (1,066,628 atoms, periodic; 2fs timestep with rigid bonds, 12A cutoff with PME every 2 steps):
directory, compressed tar archive (28 MB), zip archive (28 MB).


The benchmarking results shown above were obtained using nodes of the NVIDIA PSG cluster. The device selection is shown below for using two GPUs of each system, observing the CPU affinity of each PCI configuration obtained through nvidia-smi topo -m. NAMD depends heavily on good CPU utilization and is optimized for SMT 1 (i.e., no hyperthreading). Best results generally require using all available CPU cores.

  • Ivy Bridge system: dual processor Intel Xeon CPU E5-2690 v2 @ 3.00 GHz, 20 total cores.
    Running with two GPUs: namd2 +p20 +setcpuaffinity +devices 0,4 apoa1_nve_cuda.namd

  • Haswell system: dual processor Intel Xeon CPU E5-2698 v3 @ 2.30 GHz, 32 total cores.
    Running with two P100s: namd2 +p32 +setcpuaffinity +devices 0,1 apoa1_nve_cuda.namd
    Running with two V100s: namd2 +p32 +setcpuaffinity +devices 0,2 apoa1_nve_cuda.namd

  • Skylake system: dual processor Intel Xeon Gold 6148 CPU @ 2.4 GHz, 40 total cores.
    Running with two V100s: namd2 +p40 +setcpuaffinity +devices 0,2 apoa1_nve_cuda.namd

The following two scripts help automate for a given NAMD binary the benchmarking collection process, sweeping through a given range of CPU cores for each of the specified device lists.

  • namd_benchmark_cuda.sh
    The syntax for use is: namd_benchmark_cuda.sh NAMDBIN CONFIG MIN-MAX[:INCR] [DEV[:DEV]*]
    where NAMDBIN is the NAMD binary to benchmark and CONFIG is the configuration file, MIN is the minimum number of cores to use, MAX is the maximum number of cores to use, INCR is the core count increment, and DEV is a comma-separated device list. As an example, the following used on the Haswell V100 platform above sweeps through core counts 29-32 (i.e., +pemap 0-28, ..., +pemap 0-31) for one- and two-GPU configurations (i.e., +devices 0 and +devices 0,2):
    namd_benchmark_cuda.sh namd2 apoa1_nve_cuda.namd 29-32 0:0,2
    The timing results are stored into a Gnuplot-compatible data file designed to plot nanoseconds per day for each CUDA device choice for each number of cores benchmarked.

  • ns_per_day.py
    This Python script is called by namd_benchmark_cuda.sh to calculate nanoseconds per day averaged over the logged TIMING statements. It needs to be located in a directory of the user's PATH to permit its calling from the previous script.



Parallel scaling results



NAMD 2.13 on STMV matrix systems (2fs timestep, 12A cutoff + PME every 3 steps) on OLCF Summit


Preliminary scaling results conducted May 2018 simulating synthetic benchmark systems comprised of tiling STMV in arrays of 5x2x2 (21M atoms) and 7x6x5 (224M atoms). The use of stochastic velocity rescaling thermostat offers up to 20% performance improvement over Langevin damping thermostat by reducing work done by the CPU.



NAMD pre-2.13 on STMV matrix systems (2fs timestep, 12A cutoff + PME every 3 steps) on OLCF Summit


Preliminary scaling results for Summit conducted March 2018 simulating synthetic benchmark systems comprised of tiling STMV in arrays of 5x2x2 (21M atoms) and 7x6x5 (224M atoms). Comparing performance to other platforms.



NAMD 2.12 on STMV matrix systems (2fs timestep, 12A cutoff + PME every 3 steps) on OLCF Titan


Scaling results simulating synthetic benchmark systems comprised of tiling STMV in arrays of 5x2x2 (21M atoms) and 7x6x5 (224M atoms).



NAMD 2.12 on billion-atom STMV matrix (2fs timestep, 12A cutoff + PME every 3 steps)


First billion-atom run demonstrated on synthetic benchmark system comprised of tiling STMV in array of 10x10x10 (1.07B atoms).




Previous results



NAMD 2.11 on STMV (2fs timestep, 12A cutoff + PME every 3 steps) on TACC Stampede



NAMD 2.11 disables asynchronous offload on MIC due to memory/performance leaks in the Intel runtime.
Plot data, job, config, and output files: directory or 153k tar.gz

NAMD 2.10 on STMV (2fs timestep, 12A cutoff + PME every 3 steps) on TACC Stampede




NAMD 2.10 on 20STMV and 210STMV (2fs timestep, 12A cutoff + PME every 3 steps)

Figure from: James C. Phillips, Yanhua Sun, Nikhil Jain, Eric J. Bohm, and Laximant V. Kale.
Mapping to Irregular Torus Topologies and Other Techniques for Petascale Biomolecular Simulation.
In Proceedings of the 2014 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC14), 2014.
abstract, conference, journal
Download this benchmark: directory, 2k .tar.gz,


Previous NAMD performance webpage