Improving Parallel Scaling (NAMD 2.10b2 User's Guide)

Next: NAMD Availability and Installation Up: Running NAMD Previous: Memory Usage Contents Index

Improving Parallel Scaling

While NAMD is designed to be a scalable program, particularly for simulations of 100,000 atoms or more, at some point adding additional processors to a simulation will provide little or no extra performance. If you are lucky enough to have access to a parallel machine you should measure NAMD's parallel speedup for a variety of processor counts when running your particular simulation. The easiest and most accurate way to do this is to look at the ``Benchmark time:'' lines that are printed after 20 and 25 cycles (usually less than 500 steps). You can monitor performance during the entire simulation by adding ``outputTiming steps'' to your configuration file, but be careful to look at the ``wall time'' rather than ``CPU time'' fields on the ``TIMING:'' output lines produced. For an external measure of performance, you should run simulations of both 25 and 50 cycles (see the stepspercycle parameter) and base your estimate on the additional time needed for the longer simulation in order to exclude startup costs and allow for initial load balancing.

Multicore builds scale well within a single node. On machines with more than 32 cores it may be necessary to add a communication thread and run on one fewer core than the machine has. On a 48-core machine this would be run as ``namd2 +p47 +commthread''. Performance may also benefit from setting CPU affinity using the +setcpuaffinity +pemap <map> +commap <map> options described in CPU Affinity above. Experimentation is needed.

We provide standard (UDP), TCP, and ibverbs (InfiniBand) precompiled binaries for Linux clusters. The TCP version may be faster on some networks but the UDP version now performs well on gigabit ethernet. The ibverbs version should be used on any cluster with InfiniBand, and for any other high-speed network you should compile an MPI version.

SMP builds generally do not scale as well across nodes as single-threaded non-SMP builds because the communication thread is both a bottleneck and occupies a core that could otherwise be used for computation. As such they should only be used to reduce memory consumption or if for scaling reasons you are not using all of the cores on a node anyway, and you should run benchmarks to determine the optimal configuration.

Extremely short cycle lengths (less than 10 steps) will limit parallel scaling, since the atom migration at the end of each cycle sends many more messages than a normal force evaluation. Increasing margin from 0 to 1 while doubling stepspercycle and pairlistspercycle may help, but it is important to benchmark. The pairlist distance will adjust automatically, and one pairlist per ten steps is a good ratio.

NAMD should scale very well when the number of patches (multiply the dimensions of the patch grid) is larger or rougly the same as the number of processors. If this is not the case, it may be possible to improve scaling by adding ``twoAwayX yes'' to the config file, which roughly doubles the number of patches. (Similar options twoAwayY and twoAwayZ also exist, and may be used in combination, but this greatly increases the number of compute objects. twoAwayX has the unique advantage of also improving the scalability of PME.)

Additional performance tuning suggestions and options are described at http://www.ks.uiuc.edu/Research/namd/wiki/?NamdPerformanceTuning

Next: NAMD Availability and Installation Up: Running NAMD Previous: Memory Usage Contents Index

http://www.ks.uiuc.edu/Research/namd/