NAMD Wiki: NamdPerformanceTuning
We've tried to make NAMD run any simulation well on any number of processors, and it does a pretty good job. However, a little hand-tuning of options in the config file combined with running on the ideal number of processors for the simulation will maximize on performance and parallel efficiency. Some of these options, like pencil PME, are not supported in NAMD 2.6, so you would need to build the latest development version in CVS.
We divide performance tuning into several stages, with later stages more specific to the particular simulation and machine on which it is run.
Stage 1: Eliminate options that hurt scaling and performance
NAMD uses the patch as the fundamental unit of spatial decomposition. Smaller patches means more parallelism and more efficient pairlist generation. The width of a patch is at least the "PATCH DIMENSION" reported during NAMD startup, plus whatever padding is needed to fill the periodic cell. The minimum dimension is the pairlist distance plus the hydrogen group cutoff (2.5 A) plus the margin. Setting margin has no effect other than to increase the patch size. Except for a coarse-grained or other sparse simulations where fewer, larger patches are desired, there is no reason to set margin. It will default to 0 for constant volume simulations or a small fraction of the patch size for constant pressure. If you set the margin to avoid crashes during minimization, remove it for simulation.
The pairlist distance (pairlistDist) normally set to cutoff + 1.5 with stepsPerCycle 20 and pairlistsPerCycle defaulting to 2, resulting in pairlists generated every ten steps. Hydrogen groups (a heavy atom and all of its bonded hydrogens) are migrated between patches at the beginning of a cycle. Pairlists are recalculated pairlistsPerCycle times per cycle. If an atom in a patch moves too far then the local pairlists involving that atom are ignored until the next recalculation and the pairlist tolerance is increased slightly. Thus, the original pairlist distance determines the patch size but is automatically tuned during the simulation and pairlist violations reduce performance but do not result in missed interactions. Do not reduce stepsPerCycle unless you are getting margin violation errors during the simulation.
The PME grid size should be about one cell per Angstrom in each dimension, with factors of 2, 3, and 5 to be sure that the FFT is efficient (i.e., order N log N complexity). Newer versions of NAMD will pick this automatically if you set "pmeGridSpacing 1.0", but the grid may change when a constant pressure simulation is restarted.
Set restartFreq to a large multiple of dcdFreq. Writing a restart file uses 6 doubles per atom compared to only 3 floats per atom for a trajectory file. Writing restarts infrequently reduces the amount of data sent to node 0 by 75% and the amount written to disk by 80%. Performance may also improve by the combination of "shiftIOToOne yes", "ldbUnloadOne yes", and "noPatchesOnOne yes" to shift output to processor one.
Stage 2: Use the right number of processors
NAMD parallel efficiency should be excellent as long as the number of processors is fewer than the number of patches. You can find the number of patches from the output line "PATCH GRID IS nx BY ny BY nz". The apoa1 benchmark, for example, has numpatches = 6 x 6 x 4 = 144 patches. The first performance "sweet spot" will be numprocs = numpatches + 1 rounded up to the nearest multiple of the number of cores per node. The extra processor (zero) will have no patch and be reserved for serial operations such as output and periodic cell updates. It is probably worth setting "ldbUnloadZero yes" to ensure that no work is scheduled on processor zero.
On new clusters (such as NCSA's Abe) with eight cores per node, it may be better to leave one core free to run system processes without interfering with the simulation. This is usually only an issue when time per step is under 10 ms. The only way to tell is to run scaling benchmarks with both 7 and 8 processes per node (PPN). The node count should be increased for the 7PPN case so that numprocs is still ideal. Then calculate performance per node to decide between the two choices.
When doing performance tests, it is important to use a run that does not do minimization and to run for about 1000 steps so that initial load balancing is complete and three "Benchmark time:" lines have been printed. Use the wallclock performance data from these lines. The cpu time values are unreliable.
Setting processor (core) affinity has been observed to reduce noise and increase performance, particularly when running on numcores-1 PPN. Some MPI libraries do this automatically or have the option to do so. Otherwise adding the option +setcpuaffinity to the NAMD command line (not the config file) with newer charm builds will do so.
Stage 3: Double the number of processors
At this point we need to explicitly control NAMD's patch-splitting code, which tries to divide patches in half to use more processors, by adding "twoAwayX no", "twoAwayY no", and "twoAwayZ no" to the config file.
The second sweet spot is at numprocs = 2 numpatches + 1 (rounded up as above). At this point NAMD will use a second "buddy processor" for every patch, which balances communication and load quite well. Benchmark on this number of processors with both 7 and 8 PPN and processor affinity as described above.
For larger cutoffs (12 A) also test with "twoAwayX yes". Since the smaller patches may fit the periodic cell better, your 4x4x6 patch grid may change to 9x4x6, so be sure to check the "PATCH GRID" output, then test on numpatches + 1 processors as in stage 2.
Based on your benchmarks, you can now choose between twoAwayX yes/no and 7 or 8 PPN.
Stage 4: PME
By default NAMD uses a slab-based PME decomposition, which minimizes communication. The first step in tuning is to restrict the number of processors used for PME by setting PMEProcessors to half the number of processors being used. This will allow the 3D FFT transpose to be performed with each processor exclusively either sending or receiving data. Further reducing PMEProcessors down to the number of physical nodes in the system will ensure that only one processor per node is sending transpose data, and only one receiving as well, eliminating a potential source of node-level network contention.
Better scaling, especially for large simulations, may be observed with a pencil-based decomposition. Setting, e.g., "PMEPencils 8", will specify an 8x8 grid of pencils for each dimension. The number of pencils per grid (PMEPencils squared) should never be more than half the total number of processors. You will want to benchmark several settings for PMEPencils using the best settings and processor counts determined above.
The next step is to use dedicated processors for PME. This is enabled by setting "ldbUnloadPME yes" and using an additional 3 x PMEPencils^2 processors. You will want to use a smaller value for PMEPencils, possibly as low as 4. Run benchmarks varying PMEPencils (and processor counts) to determine the best PME pencil grid size with dedicated processors.
Stage 5: Evaluation
At this point you should step back and consider what performance improvement and parallel efficiency you have achieved. If you are happy with the rate at which the simulation is progressing then you need go no further. Consider if using more processors would result in a longer queue wait time in your environment.
Stage 6: Projections
The next step is to use the Charm++ analysis tool Projections to record and inspect the execution of NAMD. This will allow you to see if any bottlenecks are holding back the simulation. General documentation for Projections is at http://charm.cs.uiuc.edu/manuals/html/projections/
We assume below that you are familiar with compiling NAMD on your machine. To build a projections version of NAMD add "CHARMOPTS = -trace projections -trace summary" to the NAMD .arch file and relink (remove any existing binary and run make). You must be using charm libraries built with -optimize in place of --with-production (in older versions eliminate -DCMK_OPTIMIZE=1 but keep -O).
Run your best short benchmark simulation from above with the new projections binary with the options "+logsize 10000000 +traceroot DIR" where DIR is an empty directory (ideally in scratch space) where the large log files will be written. Setting +logsize ensures that data is saved in memory until the end of the run (disk writes will corrupt the run). Also setting +traceoff prevents taking data before load balancing, but this may actually interfere with load balancing so it should probably be kept off if possible.
When NAMD exits it will write trace data to DIR, which you can then copy to local scratch space and view with the Java program Projections included in the tools/projections/bin directory of the charm distribution. It is best to use the same version of charm you used to build NAMD. Make sure your machine has lots of memory and be prepared to kill projections from the command line if your machine starts paging. Run projections as "charm/tools/projections/bin/projections /DIR/namd2.sts" (assuming the binary was called "namd2").
After some loading you will see a utilization graph summary (image). Look for the last high-utilization period in the run, since this corresponds to load-balanced performance and is representative of a longer run, and write down a 100ms range to investigate further. In the Tools menu select the Timeline tool, set processors to "0:3", enter your time range where indicated (default unit is intervals, but s and ms may be specified), click Update, then click OK. After loading the Timeline window (image) will appear; you may need to enlarge the window.
You should see timelines for your selected processors with colored boxes representing "entry methods" (useful message-driven work). Click the "Show Idle" checkbox to add white when the processors are idle. Black represents "system" time spent communicating. You will notice repeated patterns that correspond to timesteps. There are a few typical performance-inhibiting behaviors to look for:
Stretches Abnormally long entry methods or stretches of (black) system time that occur randomly during the run. We usually blame these on the MPI library or operating system interference.