NAMD Wiki: NamdMemoryReduction
For larger simulations of millions of atoms NAMD may run out of memory, especially on node 0 where all I/O and load balancing takes place. There are several methods for overcoming these limitations, which may be used in combination to enable simulations of hundreds of millions of atoms.
Use an SMP version of NAMD
Each NAMD process stores a copy of the complete molecular structure. When NAMD runs as a separate process for each CPU core the molecular structure is duplicated many times per node. By using NAMD built on an "smp" (for multiple nodes) or "multicore" (for a single node) version of Charm++, only a single copy of the molecular structure is required, with this data shared by threads assigned to each cpu core. For nodes with 24-64 cores it may be desirable to use 2-4 processes per node, since each process has only one communication thread. Released binaries are available for Linux-x86_64-ibverbs-smp and Linux-x86_64-multicore platforms.
Use the memory-optimized version of NAMD
NAMD 2.8 and later may be compiled (binaries not available) in an experimental memory-optimized mode that utilizes a compressed version of the molecular structure and also supports parallel I/O. In addition to reducing per-node memory requirements, the compressed structure greatly reduces startup times compared to reading a psf file. Due to the lack of full structure data on every node, various features not normally needed for large simulations are non-functional or restricted. These limits, combined with the small number of projects using memory-optimized NAMD, cause us to caution users that this capability should be considered experimental.
Step 1: Store structure in binary rather than text format
You will want to store your structure in js and coordinates in namdbin format rather than the text-based psf and pdb formats. Converting from psf/pdb to js/namdbin is simple using the following commands with the standalone psfgen binary shipped with NAMD:
readmol psf apoa1.psf pdb apoa1.pdb writemol js apoa1.js writemol namdbin apoa1.coor
To use a js file with (non-memory-optimized) NAMD you would use the following options:
usePluginIO yes structure apoa1.js bincoordinates apoa1.coor
Do not specify a coordinates file with a .js structure file.
(It is also possible to compress directly from a psf file, but the namdbin coordinates are required.)
See http://www.ks.uiuc.edu/Research/vmd/minitutorials/largesystems/ for an example of solvating a very large system using VMD and psfgen.
Step 2: Compress binary structure
Compression is done with a non-memory-optimized build of an identical version of NAMD. While it is possible that a structure compressed with NAMD 2.9 will run with 2.10, the format of the compressed file is version-specific, undocumented, and may change at any time. Compression requires a single node with sufficient memory to load the entire structure (e.g, 64-256 GB of RAM), and may take an hour for systems of 100M atoms. The binary used for compression may be a different platform than that used to run, e.g., compress on Linux-x64_64 but run on BlueGeneQ.
To generate a compressed structure, add "genCompressedPsf on" to your NAMD config file and run the non-memory optimized NAMD. When it completes, there will be new files ending in .inter and .inter.bin in the directory that contains your structure file (e.g., apoa1.js would have added apoa1.js.inter and apoa1.js.inter.bin).
genCompressedPsf on usePluginIO yes structure apoa1.js bincoordinates apoa1.coor
Step 3: Build memory-optimized NAMD
A memory-optimized build is specified by passing "--with-memopt" to the NAMD config script (see build instructions in release notes). Normally the memory optimized version is combined with an smp version of Charm++.
./config CRAY-XE-gnu.smpmemopt --with-memopt --charm-arch gemini_gni-crayxe-persistent-smp cd CRAY-XE-gnu.smpmemopt gmake -j4 release
Step 4: Use compressed structure
Modify your NAMD config file by replacing "genCompressedPsf on" with "useCompressedPsf on", commenting out "usePluginIO yes" and adding ".inter" to the end of the structure file name. The .inter.bin file is implied by the name of the structure file. You must always specify a bincoordinates file. Restart files are always written in binary format.
useCompressedPsf on #usePluginIO yes structure apoa1.js.inter bincoordinates apoa1.coor
You are now ready to run the memory-optimized build of NAMD.
Parallel output options
numinputprocs <the number of processors user want to be used for loading molecule data into system>
numoutputprocs <the number of processors user want to be used for outputting trajectory files>
If the parameter above is not specified in the configuration file, NAMD will automatically set the appropriate value of the missing parameter depending on the number of atoms in the molecule system.
Although we do parallel output, the actual output of one type (say, the output for trajectory) is written into a single file in default. And NAMD allows user to specify the number of output processors that are writing to this file simultaneously through the parameter "numoutputwrts". The default value of this parameter is 1. Note that in NAMD the output (i.e. file I/O operations) is overlapped with useful computation. So even there's only one processor writing to the output file at a time, the performance of NAMD is only slightly affected.
Load balancer options
"ldBalancer hybrid" - uses hierarchical scheme beyond 512 cores
"ldBalancer none" - disable load balancing completely, in case it takes too long
"noPatchesOnZero yes" - remove bottleneck on core 0, automatic at 64 cores
"ldbUnloadZero yes" - remove bottleneck on core 0
"noPatchesOnOne yes" - remove bottleneck on core 1
"ldbUnloadOne yes" - remove bottleneck on core 1
"maxSelfPart 1" - limit compute count, not needed in NAMD 2.10
"maxPairPart 1" - limit compute count, not needed in NAMD 2.10
"pairlistMinProcs 128" - don't use pairlists when running on fewer than 128 cores
Options for petascale simulations
When running 10-million to 100-million-atom simulations on thousands of nodes the following options may be useful:
"langevinPistonBarrier off" - Introduces a single-timestep lag in pressure adjustments.
"pmeProcessors [numNodes]" - limits PME to one pencil per node for reduced message count
"pmeInterpOrder 8" - allows use of lower (half) resolution PME grid for reduced communication
"pmeGridSpacing 2.1" - don't complain that grid is too coarse
Using a PME grid spacing of 2A with an interpolation order of 8 increases the amount of local work needed to spread charges to the grid and extra forces, but reduces global network bandwidth needed for PME by a factor of 8. Be sure to actually increase the grid spacing to 2A.
Running one process per node (for Cray XE6 nodes) seems to scale better than two or four, although this may be just because it reduces pmeProcessors to one per physical node. An alternative may be "pmeProcessors [numPhysicalNodes]".