NAMD 2.15 Alpha, source code for x86 Intel GPU
The following guide is adapted from the notes_sycl.txt available within the source code tar ball discussed below.
NAMD 2.15 Release Notes for Intel® Max Series GPU support
Intel GPUs are supported with a new code path implemented in SYCL (https://www.khronos.org/sycl/) together with Intel's new oneAPI toolkit (https://www.intel.com/content/www/us/en/developer/tools/oneapi/overview.html) used to build the SYCL code and provide some additional library support.
NAMD's SYCL code presently provide GPU-offload functionality, with force kernels implemented for the non-bonded and bonded force terms and for the PME kernels. For now, we reuse the existing NAMD simulation parameters to similarly control SYCL kernel behavior, e.g., bondedCUDA, usePMECUDA, PMEoffload. Development work to port NAMD's GPU-resident CUDA kernels is ongoing.
Building the code requires setting up Intel's oneAPI. Download the oneAPI base toolkit to obtain the DPC++ compiler (dpcpp), oneMKL, and oneDPL (https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html).
NAMD has been tested with Intel oneAPI 2023.2.1, oneAPI 2024.0, agama-ci-devel/736.25, Ubuntu 22.04 on Intel® Data Center GPU Max 1550, Intel® Data Center GPU Max 1100.
Obtaining the code
NAMD's SYCL code is available in two forms. There is a source tar ball available:
- NAMD 2.15alpha3_Source.tar.gz (Dec 18, 2023)
Includes current snapshot of Charm++ source code repository
An alternative way to get source code is through NAMD's repository hosted on GitLab. Obtaining access to the repository requires registering at https://www.ks.uiuc.edu/Research/namd/gitlabrequest.html which might take 24 hours to enable access. Once access has been enabled, the code is retrieved as follows:
git clone https://gitlab.com/tcbgUIUC/namd.git
cd namd
git checkout oneapi-forces
tar xzf NAMD_2.15alpha3_Source.tar.gz
cd namd
Building multicore version
Building a complete NAMD binary from source code requires:
- a compiled version of the Charm++/Converse library;
- a compiled version of the TCL library and its header files;
- a compiled version of the FFTW library and its header files;
- a C shell (csh/tcsh) to run the script used to configure the build.
Precompiled TCL and FFTW libraries are available from http://www.ks.uiuc.edu/Research/namd/libraries/. From within the base level namd directory, either from the tar ball or from cloning the repo, issue the following commands:
wget http://www.ks.uiuc.edu/Research/namd/libraries/tcl8.5.9-linux-x86_64-threaded.tar.gz
tar xzf tcl8.5.9-linux-x86_64-threaded.tar.gz
ln -s tcl8.5.9-linux-x86_64-threaded tcl
wget http://www.ks.uiuc.edu/Research/namd/libraries/fftw-linux-x86_64.tar.gz
tar xzf fftw-linux-x86_64.tar.gz
ln -s linux-x86_64 fftw
Charm++ built for NAMD/SYCL has additional requirements beyond the version 7.0.0 release, so it is recommended to build the latest verison from GitHub. Do the following from the namd base level directory:
git clone https://github.com/UIUC-PPL/charm.git
cd charm
./build charm++ multicore-linux-x86_64 icx -j --with-production
cd ..
The recommended NAMD build arch file for SYCL/DPC++ is arch/Linux-x86_64-dpcpp-AOT.arch (AOT for ahead-of-time) that performs ahead-of-time compilation for the SYCL kernels. In the base level NAMD directory do:
./config Linux-x86_64-dpcpp-AOT --charm-arch multicore-linux-x86_64-icx
cd Linux-x86_64-dpcpp-AOT
make -j
Basic testing
Running smoke tests on alanin (66 atoms):
Test on 2-tile GPU device explicit scaling (run inside build directory,
where +splittiles argument is critical for correct operation here).
In order to use the +splittiles option, make sure to set the
environment variable ZE_FLAT_DEVICE_HIERARCHY to COMPOSITE:
export ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE
./namd2 +nostreaming +p4 +devices 0,1 +splittiles +platform "Intel(R) Level-Zero" src/alanin
Test single tile or implicit scaling on GPU device (run inside the configured directory, Linux-x86_64-dpcpp):
./namd2 +nostreaming +p4 +devices 0 +platform "Intel(R) Level-Zero" src/alanin
Test single tile ONLY on GPU device (run inside the configured directory, Linux-x86_64-dpcpp):
export ZE_AFFINITY_MASK=0.0
./namd2 +nostreaming +p4 +devices 0 +platform "Intel(R) Level-Zero" src/alanin
Running debug system "tiny" (507 atoms):
wget https://www.ks.uiuc.edu/Research/namd/utilities/tiny.tar.gz
tar xvf tiny.tar.gz
./namd2 +nostreaming +p4 +devices 0 +platform "Intel(R) Level-Zero" tiny/tiny.namd
Testing a small benchmark system
Running benchmark system ApoA1 (~92k atoms):
wget https://www.ks.uiuc.edu/Research/namd/utilities/apoa1.tar.gz
tar xvf apoa1.tar.gz
cd apoa1
wget https://www.ks.uiuc.edu/Research/namd/2.13/benchmarks/apoa1_nve_cuda.namd
cd ..
export ZE_AFFINITY_MASK=0.0
./namd2 +nostreaming +p8 +devices 0 +platform "Intel(R) Level-Zero" apoa1/apoa1_nve_cuda.namd
/opt/intel/oneapi/pti-gpu/onetrace/onetrace -d -o namd-timing.txt ./namd2 +nostreaming +p8 +devices 0 +platform "Intel(R) Level-Zero" apoa1/apoa1_nve_cuda.namd
Running larger benchmark systems
Different benchmark systems have been tested with the SYCL/DPC++ build. F1-ATPase has ~327.5k atoms, and STMV has ~1.06M atoms.
Running the F1-ATPase benchmark:
wget https://www.ks.uiuc.edu/Research/namd/utilities/f1atpase.tar.gz
tar xzf f1atpase.tar.gz
./namd2 +nostreaming +p64 +devices 0,1 +splittiles +pemap 4-67 +platform "Intel(R) Level-Zero" f1atpase/f1atpase.namd
Running the STMV benchmark:
wget https://www.ks.uiuc.edu/Research/namd/utilities/stmv.tar.gz
tar xzf stmv.tar.gz
cd stmv
wget https://www.ks.uiuc.edu/Research/namd/2.13/benchmarks/stmv_nve_cuda.namd
cd ..
./namd2 +nostreaming +p64 +devices 0,1 +splittiles +pemap 4-67 +platform "Intel(R) Level-Zero" --bondedCUDA 0 stmv/stmv_nve_cuda.namd
Building and running multi-node version
Multi-node NAMD requires building a version of Charm++ that supports your parallel computer or cluster network. Shown below is how Charm++ can be built to use the Intel MPI installation:
cd ../charm
CC=icx; CXX=icpx; F90=ifort; F77=ifort; MPICXX=mpiicpc; MPI_CXX=mpiicpc
I_MPI_CC=icx; I_MPI_CXX=icpx; I_MPI_F90=ifort; I_MPI_F77=ifort
export I_MPI_CC I_MPI_CXX I_MPI_F90 I_MPI_F77 CC CXX F90 F77 MPICXX MPI_CXX
./build charm++ mpi-linux-x86_64 smp mpicxx --with-production
cd ..
Once a working multi-node version of Charm++ is built, NAMD needs to be built using that new build of Charm++:
./config Linux-x86_64-dpcpp-AOT.mpi-smp --charm-arch mpi-linux-x86_64-smp-mpicxx
cd Linux-x86_64-dpcpp-AOT.mpi-smp
make -j
Modify the environment for multi-node support:
unset ZE_AFFINITY_MASK
export ZE_FLAT_DEVICE_HIERARCHY=FLAT
Running on 1 PVC GPU (2 tiles):
mpirun -perhost 2 ./namd2 +ppn 31 +pemap 1-31:32.31,57-87:32.31 +commap 0-31:32,56-87:32 +devices 0,1 +platform "Intel(R) Level-Zero" stmv/stmv_nve_cuda.namd
mpirun -perhost 8 ./namd2 +ppn 7 +pemap 1-31:8.7,57-87:8.7 +commap 0-31:8,56-87:8 +devices 0,1,2,3,4,5,6,7 +platform "Intel(R) Level-Zero" stmv/stmv_nve_cuda.namd
salloc -C IB -N 1 mpirun -perhost 8 ./namd2 +ppn 7 +pemap 1-31:8.7,57-87:8.7 +commap 0-31:8,56-87:8 +devices 0,1,2,3,4,5,6,7 +platform "Intel(R) Level-Zero" stmv/stmv_nve_cuda.namd
Multi-node/multi-process for NAMD are best scheduled with 1 device (tile) per rank (process). When running NAMD in this manner, the scalable parts of the PME algorithm (charge spreading to the grid and force interpolation from the grid) can be executed across the devices. The following options can be added to the config file to control offloading these calculations (and preventing the non-scalable PME calculations from being executed on a device):
PMEoffload on ;# scalable PME calculations requires one rank per device
usePMECUDA off ;# disable offloading the non-scalable parts of PME
--PMEoffload on --usePMECUDA off
This code is still evolving and will be updated as needed. Stay tuned!
For more information about NAMD and support inquiries:
For general NAMD information, see the main NAMD home page http://www.ks.uiuc.edu/Research/namd/
For your convenience, NAMD has been ported to and will be installed on the machines at the NSF-sponsored national supercomputing centers. If you are planning substantial simulation work of an academic nature you should apply for these resources. Benchmarks for your proposal are available at http://www.ks.uiuc.edu/Research/namd/performance.html
The Theoretical and Computational Biophysics Group encourages NAMD users to be closely involved in the development process through reporting bugs, contributing fixes, periodic surveys and via other means. Questions or comments may be directed to namd@ks.uiuc.edu.
We are eager to hear from you, and thank you for using our software!