The following guide is adapted from the notes_sycl.txt available within the source code tar ball discussed below.

NAMD 2.15 Release Notes for Intel® Max Series GPU support

Intel GPUs are supported with a new code path implemented in SYCL (https://www.khronos.org/sycl/) together with Intel's new oneAPI toolkit (https://www.intel.com/content/www/us/en/developer/tools/oneapi/overview.html) used to build the SYCL code and provide some additional library support.

NAMD's SYCL code presently provide GPU-offload functionality, with force kernels implemented for the non-bonded and bonded force terms and for the PME kernels. For now, we reuse the existing NAMD simulation parameters to similarly control SYCL kernel behavior, e.g., bondedCUDA, usePMECUDA, PMEoffload. Development work to port NAMD's GPU-resident CUDA kernels is ongoing.

Building the code requires setting up Intel's oneAPI. Download the oneAPI base toolkit to obtain the DPC++ compiler (dpcpp), oneMKL, and oneDPL (https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html).

NAMD has been tested with Intel oneAPI 2023.2.1, oneAPI 2024.0, agama-ci-devel/736.25, Ubuntu 22.04 on Intel® Data Center GPU Max 1550, Intel® Data Center GPU Max 1100.

Obtaining the code

NAMD's SYCL code is available in two forms. There is a source tar ball available:

An alternative way to get source code is through NAMD's repository hosted on GitLab. Obtaining access to the repository requires registering at https://www.ks.uiuc.edu/Research/namd/gitlabrequest.html which might take 24 hours to enable access. Once access has been enabled, the code is retrieved as follows:

  • git clone https://gitlab.com/tcbgUIUC/namd.git
  • cd namd
  • git checkout oneapi-forces
The tar ball is simply a snapshot of this "oneapi-forces" branch:
  • tar xzf NAMD_2.15alpha3_Source.tar.gz
  • cd namd

Building multicore version

Building a complete NAMD binary from source code requires:

  • a compiled version of the Charm++/Converse library;
  • a compiled version of the TCL library and its header files;
  • a compiled version of the FFTW library and its header files;
  • a C shell (csh/tcsh) to run the script used to configure the build.

Precompiled TCL and FFTW libraries are available from http://www.ks.uiuc.edu/Research/namd/libraries/. From within the base level namd directory, either from the tar ball or from cloning the repo, issue the following commands:

  • wget http://www.ks.uiuc.edu/Research/namd/libraries/tcl8.5.9-linux-x86_64-threaded.tar.gz
  • tar xzf tcl8.5.9-linux-x86_64-threaded.tar.gz
  • ln -s tcl8.5.9-linux-x86_64-threaded tcl
  • wget http://www.ks.uiuc.edu/Research/namd/libraries/fftw-linux-x86_64.tar.gz
  • tar xzf fftw-linux-x86_64.tar.gz
  • ln -s linux-x86_64 fftw
By having directory links named "tcl" and "fftw", the build system will automatically configure to use these libraries.

Charm++ built for NAMD/SYCL has additional requirements beyond the version 7.0.0 release, so it is recommended to build the latest verison from GitHub. Do the following from the namd base level directory:

  • git clone https://github.com/UIUC-PPL/charm.git
  • cd charm
  • ./build charm++ multicore-linux-x86_64 icx -j --with-production
  • cd ..

The recommended NAMD build arch file for SYCL/DPC++ is arch/Linux-x86_64-dpcpp-AOT.arch (AOT for ahead-of-time) that performs ahead-of-time compilation for the SYCL kernels. In the base level NAMD directory do:

  • ./config Linux-x86_64-dpcpp-AOT --charm-arch multicore-linux-x86_64-icx
  • cd Linux-x86_64-dpcpp-AOT
  • make -j
NAMD is used to run the simulation config file (usually .namd or .conf) and performs molecular dynamics after loading the input data files. The following examples demonstrate running NAMD.

Basic testing

Running smoke tests on alanin (66 atoms):
Test on 2-tile GPU device explicit scaling (run inside build directory, where +splittiles argument is critical for correct operation here). In order to use the +splittiles option, make sure to set the environment variable ZE_FLAT_DEVICE_HIERARCHY to COMPOSITE:

  • export ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE
  • ./namd2 +nostreaming +p4 +devices 0,1 +splittiles +platform "Intel(R) Level-Zero" src/alanin

Test single tile or implicit scaling on GPU device (run inside the configured directory, Linux-x86_64-dpcpp):

  • ./namd2 +nostreaming +p4 +devices 0 +platform "Intel(R) Level-Zero" src/alanin

Test single tile ONLY on GPU device (run inside the configured directory, Linux-x86_64-dpcpp):

  • export ZE_AFFINITY_MASK=0.0
  • ./namd2 +nostreaming +p4 +devices 0 +platform "Intel(R) Level-Zero" src/alanin

Running debug system "tiny" (507 atoms):

  • wget https://www.ks.uiuc.edu/Research/namd/utilities/tiny.tar.gz
  • tar xvf tiny.tar.gz
  • ./namd2 +nostreaming +p4 +devices 0 +platform "Intel(R) Level-Zero" tiny/tiny.namd

Testing a small benchmark system

Running benchmark system ApoA1 (~92k atoms):

  • wget https://www.ks.uiuc.edu/Research/namd/utilities/apoa1.tar.gz
  • tar xvf apoa1.tar.gz
The apoa1_nve_cuda.namd configuration file is better suited for leveraging GPU hardware than the config file included in the archive.
  • cd apoa1
  • wget https://www.ks.uiuc.edu/Research/namd/2.13/benchmarks/apoa1_nve_cuda.namd
  • cd ..
Testing ApoA1 on ATS/PVC 1-tile on host with 8 cores, level-zero backend:
  • export ZE_AFFINITY_MASK=0.0
  • ./namd2 +nostreaming +p8 +devices 0 +platform "Intel(R) Level-Zero" apoa1/apoa1_nve_cuda.namd
Collecting the non-bonded force kernel time:
  • /opt/intel/oneapi/pti-gpu/onetrace/onetrace -d -o namd-timing.txt ./namd2 +nostreaming +p8 +devices 0 +platform "Intel(R) Level-Zero" apoa1/apoa1_nve_cuda.namd

Running larger benchmark systems

Different benchmark systems have been tested with the SYCL/DPC++ build. F1-ATPase has ~327.5k atoms, and STMV has ~1.06M atoms.

Running the F1-ATPase benchmark:

  • wget https://www.ks.uiuc.edu/Research/namd/utilities/f1atpase.tar.gz
  • tar xzf f1atpase.tar.gz
  • ./namd2 +nostreaming +p64 +devices 0,1 +splittiles +pemap 4-67 +platform "Intel(R) Level-Zero" f1atpase/f1atpase.namd

Running the STMV benchmark:

  • wget https://www.ks.uiuc.edu/Research/namd/utilities/stmv.tar.gz
  • tar xzf stmv.tar.gz
The stmv_nve_cuda.namd configuration file is better suited for leveraging GPU hardware than the config file included in the archive.
  • cd stmv
  • wget https://www.ks.uiuc.edu/Research/namd/2.13/benchmarks/stmv_nve_cuda.namd
  • cd ..
Performance is improved by offloading only the non-bonded force kernels and keeping the bonded calculations on the CPU (--bondedCUDA 0). The following runs multicore NAMD on 2 tiles:
  • ./namd2 +nostreaming +p64 +devices 0,1 +splittiles +pemap 4-67 +platform "Intel(R) Level-Zero" --bondedCUDA 0 stmv/stmv_nve_cuda.namd

Building and running multi-node version

Multi-node NAMD requires building a version of Charm++ that supports your parallel computer or cluster network. Shown below is how Charm++ can be built to use the Intel MPI installation:

  • cd ../charm
  • CC=icx; CXX=icpx; F90=ifort; F77=ifort; MPICXX=mpiicpc; MPI_CXX=mpiicpc
  • I_MPI_CC=icx; I_MPI_CXX=icpx; I_MPI_F90=ifort; I_MPI_F77=ifort
  • export I_MPI_CC I_MPI_CXX I_MPI_F90 I_MPI_F77 CC CXX F90 F77 MPICXX MPI_CXX
  • ./build charm++ mpi-linux-x86_64 smp mpicxx --with-production
  • cd ..

Once a working multi-node version of Charm++ is built, NAMD needs to be built using that new build of Charm++:

  • ./config Linux-x86_64-dpcpp-AOT.mpi-smp --charm-arch mpi-linux-x86_64-smp-mpicxx
  • cd Linux-x86_64-dpcpp-AOT.mpi-smp
  • make -j

Modify the environment for multi-node support:

  • unset ZE_AFFINITY_MASK
  • export ZE_FLAT_DEVICE_HIERARCHY=FLAT
Setting the environment variable to "FLAT" treats each tile as a separate device, so that the "+splittiles" option should no longer be used.

Running on 1 PVC GPU (2 tiles):

  • mpirun -perhost 2 ./namd2 +ppn 31 +pemap 1-31:32.31,57-87:32.31 +commap 0-31:32,56-87:32 +devices 0,1 +platform "Intel(R) Level-Zero" stmv/stmv_nve_cuda.namd
Running on 4 PVC GPUs (8 tiles):
  • mpirun -perhost 8 ./namd2 +ppn 7 +pemap 1-31:8.7,57-87:8.7 +commap 0-31:8,56-87:8 +devices 0,1,2,3,4,5,6,7 +platform "Intel(R) Level-Zero" stmv/stmv_nve_cuda.namd
Running on multiple nodes with 4 PVC GPUs (8 tiles) on each node:
  • salloc -C IB -N 1 mpirun -perhost 8 ./namd2 +ppn 7 +pemap 1-31:8.7,57-87:8.7 +commap 0-31:8,56-87:8 +devices 0,1,2,3,4,5,6,7 +platform "Intel(R) Level-Zero" stmv/stmv_nve_cuda.namd

Multi-node/multi-process for NAMD are best scheduled with 1 device (tile) per rank (process). When running NAMD in this manner, the scalable parts of the PME algorithm (charge spreading to the grid and force interpolation from the grid) can be executed across the devices. The following options can be added to the config file to control offloading these calculations (and preventing the non-scalable PME calculations from being executed on a device):

  • PMEoffload on ;# scalable PME calculations requires one rank per device
  • usePMECUDA off ;# disable offloading the non-scalable parts of PME
These (and other) options can also be set when launching NAMD:
  • --PMEoffload on --usePMECUDA off

This code is still evolving and will be updated as needed. Stay tuned!

For more information about NAMD and support inquiries:

For general NAMD information, see the main NAMD home page http://www.ks.uiuc.edu/Research/namd/

For your convenience, NAMD has been ported to and will be installed on the machines at the NSF-sponsored national supercomputing centers. If you are planning substantial simulation work of an academic nature you should apply for these resources. Benchmarks for your proposal are available at http://www.ks.uiuc.edu/Research/namd/performance.html

The Theoretical and Computational Biophysics Group encourages NAMD users to be closely involved in the development process through reporting bugs, contributing fixes, periodic surveys and via other means. Questions or comments may be directed to namd@ks.uiuc.edu.

We are eager to hear from you, and thank you for using our software!