CUDA GPU Acceleration (NAMD 2.13b2 User's Guide)

Next: Xeon Phi Acceleration Up: Running NAMD Previous: CPU Affinity Contents Index

CUDA GPU Acceleration

NAMD does not offload the entire calculation to the GPU, and performance may therefore be limited by the CPU. In general all available CPU cores should be used, with CPU affinity set as described above.

Energy evaluation is slower than calculating forces alone, and the loss is much greater in CUDA-accelerated builds. Therefore you should set outputEnergies to 100 or higher in the simulation config file. Some features are unavailable in CUDA builds, including alchemical free energy perturbation and the Lowe-Andersen thermostat.

As this is a new feature you are encouraged to test all simulations before beginning production runs. Forces evaluated on the GPU differ slightly from a CPU-only calculation, an effect more visible in reported scalar pressure values than in energies.

To benefit from GPU acceleration you will need a CUDA build of NAMD and a recent high-end NVIDIA video card. CUDA builds will not function without a CUDA-capable GPU and a driver that supports CUDA 6.0. If the installed driver is too old NAMD will exit on startup with the error ``CUDA driver version is insufficient for CUDA runtime version''.

Finally, if NAMD was not statically linked against the CUDA runtime then the libcudart.so file included with the binary (copied from the version of CUDA it was built with) must be in a directory in your LD_LIBRARY_PATH before any other libcudart.so libraries. For example, when running a multicore binary (recommended for a single machine):

  setenv LD_LIBRARY_PATH ".:$LD_LIBRARY_PATH"
  (or LD_LIBRARY_PATH=".:$LD_LIBRARY_PATH"; export LD_LIBRARY_PATH)
  ./namd2 +p8 +setcpuaffinity <configfile>

Each namd2 thread can use only one GPU. Therefore you will need to run at least one thread for each GPU you want to use. Multiple threads can share a single GPU, usually with an increase in performance. NAMD will automatically distribute threads equally among the GPUs on a node. Specific GPU device IDs can be requested via the +devices argument on the namd2 command line, for example:

  ./namd2 +p8 +setcpuaffinity +devices 0,2 <configfile>

Devices are shared by consecutive threads in a process, so in the above example threads 0-3 will share device 0 and threads 4-7 will share device 2. Repeating a device will cause it to be assigned to multiple master threads, which is allowed only for different processes and is advised against in general but may be faster in certain cases. When running on multiple nodes the +devices specification is applied to each physical node separately and there is no way to provide a unique list for each node.

When running a multi-node parallel job it is recommended to have one process per device to maximize the number of communication threads. If the job launch system enforces device segregation such that not all devices are visible to each process then the +ignoresharing argument must be used to disable the shared-device error message.

When running a multi-copy simulation with both multiple replicas and multiple devices per physical node, the +devicesperreplica n argument must be used to prevent each replica from binding all of the devices. For example, for 2 replicas per 6-device host use +devicesperreplica 3.

GPUs of compute capability 3.0 are no longer supported and are ignored. GPUs with two or fewer multiprocessors are ignored unless specifically requested with +devices.

While charmrun with ++local will preserve LD_LIBRARY_PATH, normal charmrun does not. You can use charmrun ++runscript to add the namd2 directory to LD_LIBRARY_PATH with the following executable runscript:

  #!/bin/csh
  setenv LD_LIBRARY_PATH "${1:h}:$LD_LIBRARY_PATH"
  $*

For example:

  ./charmrun ++runscript ./runscript +p60 ./namd2 ++ppn 15 <configfile>

An InfiniBand network is highly recommended when running CUDA-accelerated NAMD across multiple nodes. You will need either an ibverbs NAMD binary (available for download) or an MPI NAMD binary (must build Charm++ and NAMD as described above) to make use of the InfiniBand network. The use of SMP binaries is also recommended when running on multiple nodes, with one process per GPU and as many threads as available cores, reserving one core per process for the communication thread.

The CUDA (NVIDIA's graphics processor programming platform) code in NAMD is completely self-contained and does not use any of the CUDA support features in Charm++. When building NAMD with CUDA support you should use the same Charm++ you would use for a non-CUDA build. Do NOT add the cuda option to the Charm++ build command line. The only changes to the build process needed are to add -with-cuda and possibly -cuda-prefix ... to the NAMD config command line.

Next: Xeon Phi Acceleration Up: Running NAMD Previous: CPU Affinity Contents Index

http://www.ks.uiuc.edu/Research/namd/