next up previous contents index
Next: Xeon Phi Acceleration Up: Running NAMD Previous: CPU Affinity   Contents   Index


CUDA GPU Acceleration

NAMD does not offload the entire calculation to the GPU, and performance may therefore be limited by the CPU. In general all available CPU cores should be used, with CPU affinity set as described above.

Energy evaluation is slower than calculating forces alone, and the loss is much greater in CUDA-accelerated builds. Therefore you should set outputEnergies to 100 or higher in the simulation config file. Forces evaluated on the GPU differ slightly from a CPU-only calculation, an effect more visible in reported scalar pressure values than in energies.

NAMD now has the entire force calculation offloaded to GPU for conventional MD simulation options. However, not all advanced features are compatible with CUDA-accelerated NAMD builds, in particular, any simulation option that requires modification to the functional form of the non-bonded forces. Note that QM/MM simulation is also disabled for CUDA-accelerated NAMD, because the calculation is bottlenecked by the QM calculation rather than the MM force calculation, so can benefit from CUDA acceleration of the QM part when available. Table 1 lists the parts of NAMD that are accelerated with CUDA-capable GPUs, and Table 2 lists the advanced simulation options that are disabled within a CUDA-accelerated NAMD build.

Table 1: NAMD GPU: What is accelerated?
Accelerated Not Accelerated
short-range non-bonded integration
PME reciprocal sum rigid bonds
bonded terms grid forces
implicit solvent collective variables
NVIDIA GPUs only  

Table 2: NAMD GPU: What features are disabled?
Disabled Not Disabled
Alchemical (FEP and TI) Memory optimized builds
Locally enhanced sampling Conformational free energy
Tabulated energies Collective variables
Drude (nonbonded Thole) Grid forces
Go forces Steering forces
Pairwaise interaction Almost everything else
Pressure profile  

To benefit from GPU acceleration you will need a CUDA build of NAMD and a recent NVIDIA video card. CUDA builds will not function without a CUDA-capable GPU and a driver that supports CUDA 8.0. If the installed driver is too old NAMD will exit on startup with the error ``CUDA driver version is insufficient for CUDA runtime version.''

Finally, if NAMD was not statically linked against the CUDA runtime then the file included with the binary (copied from the version of CUDA it was built with) must be in a directory in your LD_LIBRARY_PATH before any other libraries. For example, when running a multicore binary (recommended for a single machine):

  ./namd2 +p8 +setcpuaffinity <configfile>

Each namd2 thread can use only one GPU. Therefore you will need to run at least one thread for each GPU you want to use. Multiple threads in an SMP build of NAMD can share a single GPU, usually with an increase in performance. NAMD will automatically distribute threads equally among the GPUs on a node. Specific GPU device IDs can be requested via the +devices argument on the namd2 command line, for example:

  ./namd2 +p8 +setcpuaffinity +devices 0,2 <configfile>

Devices are shared by consecutive threads in a process, so in the above example threads 0-3 will share device 0 and threads 4-7 will share device 2. Repeating a device will cause it to be assigned to multiple master threads, which is allowed only for different processes and is advised against in general but may be faster in certain cases. When running on multiple nodes the +devices specification is applied to each physical node separately and there is no way to provide a unique list for each node.

When running a multi-node parallel job it is recommended to have one process per device to maximize the number of communication threads. If the job launch system enforces device segregation such that not all devices are visible to each process then the +ignoresharing argument must be used to disable the shared-device error message.

When running a multi-copy simulation with both multiple replicas and multiple devices per physical node, the +devicesperreplica $ <$ n$ >$ argument must be used to prevent each replica from binding all of the devices. For example, for 2 replicas per 6-device host use +devicesperreplica 3.

GPUs of compute capability $ <$ 3.0 are no longer supported and are ignored. GPUs with two or fewer multiprocessors are ignored unless specifically requested with +devices.

While charmrun with ++local will preserve LD_LIBRARY_PATH, normal charmrun does not. You can use charmrun ++runscript to add the namd2 directory to LD_LIBRARY_PATH with the following executable runscript:


For example:

  ./charmrun ++runscript ./runscript +p60 ./namd2 ++ppn 15 <configfile>

An InfiniBand network is highly recommended when running CUDA-accelerated NAMD across multiple nodes. You will need either an ibverbs NAMD binary (available for download) or an MPI NAMD binary (must build Charm++ and NAMD as described above) to make use of the InfiniBand network. The use of SMP binaries is also recommended when running on multiple nodes, with one process per GPU and as many threads as available cores, reserving one core per process for the communication thread.

The CUDA (NVIDIA's graphics processor programming platform) code in NAMD is completely self-contained and does not use any of the CUDA support features in Charm++. When building NAMD with CUDA support you should use the same Charm++ you would use for a non-CUDA build. Do NOT add the cuda option to the Charm++ build command line. The only changes to the build process needed are to add -with-cuda and possibly -cuda-prefix ... to the NAMD config command line.

Right now, NAMD does not support all features available on GPUs. Thus, some keywords were introduced to help the user have a better control of the calculation. These keywords are relevant only for CUDA builds, and are ignored if the user is running a CPU build.


next up previous contents index
Next: Xeon Phi Acceleration Up: Running NAMD Previous: CPU Affinity   Contents   Index