NAMD 2.15 Release Notes for Intel(R) Max Series GPU support
 
Intel GPUs are supported with a new code path implemented in SYCL 
(https://www.khronos.org/sycl/) together with Intel's new oneAPI toolkit 
(https://www.intel.com/content/www/us/en/developer/tools/oneapi/overview.html)
used to build the SYCL code and provide some additional library support.

NAMD's SYCL code presently provide GPU-offload functionality, with force 
kernels implemented for the non-bonded and bonded force terms and for 
the PME kernels. For now, we reuse the existing NAMD simulation 
parameters to similarly control SYCL kernel behavior, e.g., 
bondedCUDA, usePMECUDA, PMEoffload.  Development work to port NAMD's 
GPU-resident CUDA kernels is ongoing. 

Building the code requires setting up Intel's oneAPI.  Download the oneAPI 
base toolkit to obtain the DPC++ compiler (dpcpp), oneMKL, and oneDPL:
https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html

NAMD has been tested with Intel oneAPI 2023.2.1, oneAPI 2024.0,
agama-ci-devel/736.25, Ubuntu 22.04 on Intel(R) Data Center GPU Max 1550, 
Intel(R) Data Center GPU Max 1100.

NAMD's SYCL code is available in two forms.  There is a source tar ball 
NAMD_2.15a3_Source-IntelGPU.tgz available from the download web page. 
Go to

  https://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=NAMD

and under the selection labeled 

  Version 2.15 ALPHA Release

click on the link

  Linux-x86_64-multicore-IntelGPU

After either registering or logging in if already a registered user, the 
link goes to an "alpha" web page to support building SYCL/DPC++ for 
running NAMD on Intel GPUs. 

An alternative way to get source code is through NAMD's repository hosted 
on GitLab.  Obtaining access to the repository requires registering at 
https://www.ks.uiuc.edu/Research/namd/gitlabrequest.html which might take 
24 hours to enable access.  Once access has been enabled, the code is 
retrieved as follows: 

  git clone https://gitlab.com/tcbgUIUC/namd.git
  cd namd
  git checkout oneapi-forces

The tar ball is simply a snapshot of this "oneapi-forces" branch: 

  tar xzf NAMD_2.15alpha3_Source.tar.gz
  cd namd

Building a complete NAMD binary from source code requires:

- a compiled version of the Charm++/Converse library;
- a compiled version of the TCL library and its header files;
- a compiled version of the FFTW library and its header files;
- a C shell (csh/tcsh) to run the script used to configure the build.

Precompiled TCL and FFTW libraries are available from
http://www.ks.uiuc.edu/Research/namd/libraries/.  From within the base 
level namd directory, either from the tar ball or from cloning the repo, 
issue the following commands:

  wget http://www.ks.uiuc.edu/Research/namd/libraries/tcl8.5.9-linux-x86_64-threaded.tar.gz
  tar xzf tcl8.5.9-linux-x86_64-threaded.tar.gz
  ln -s tcl8.5.9-linux-x86_64-threaded tcl
  
  wget http://www.ks.uiuc.edu/Research/namd/libraries/fftw-linux-x86_64.tar.gz
  tar xzf fftw-linux-x86_64.tar.gz
  ln -s linux-x86_64 fftw

By having directory links named "tcl" and "fftw", the build system will 
automatically configure to use these libraries. 

Charm++ built for NAMD/SYCL has additional requirements beyond the 
version 7.0.0 release, so it is recommended to build the latest verison 
from GitHub.  Do the following from the namd base level directory:

  git clone  https://github.com/UIUC-PPL/charm.git
  cd charm/
  ./build charm++ multicore-linux-x86_64 icx -j --with-production
  cd ..

The recommended NAMD build arch file for SYCL/DPC++ is
arch/Linux-x86_64-dpcpp-AOT.arch (AOT for ahead-of-time) that performs 
ahead-of-time compilation for the SYCL kernels.  In the base level NAMD 
directory do:

  ./config Linux-x86_64-dpcpp-AOT --charm-arch multicore-linux-x86_64-icx
  cd Linux-x86_64-dpcpp-AOT
  make -j

NAMD is used to run the simulation config file (usually .namd or .conf) 
and performs molecular dynamics after loading the input data files. 
The following examples demonstrate running NAMD.

Running smoke tests on alanin (66 atoms):

Test on 2-tile GPU device explicit scaling (run inside build directory, 
where +splittiles argument is critical for correct operation here).
In order to use the +splittiles option, make sure to set the
environment variable ZE_FLAT_DEVICE_HIERARCHY to COMPOSITE:
  
  export ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE
  ./namd2 +nostreaming +p4 +devices 0,1 +splittiles +platform "Intel(R) Level-Zero" src/alanin

Test single tile or implicit scaling on GPU device (run inside the 
configured directory, Linux-x86_64-dpcpp):

  ./namd2 +nostreaming +p4 +devices 0 +platform "Intel(R) Level-Zero" src/alanin

Test single tile ONLY on GPU device (run inside the configured directory, Linux-x86_64-dpcpp):

  export ZE_AFFINITY_MASK=0.0
  ./namd2 +nostreaming +p4 +devices 0 +platform "Intel(R) Level-Zero" src/alanin

Running debug system "tiny" (507 atoms):

  wget  https://www.ks.uiuc.edu/Research/namd/utilities/tiny.tar.gz
  tar xvf tiny.tar.gz
  ./namd2 +nostreaming +p4 +devices 0 +platform "Intel(R) Level-Zero" tiny/tiny.namd
  
Running benchmark system ApoA1 (~92k atoms):

  wget https://www.ks.uiuc.edu/Research/namd/utilities/apoa1.tar.gz
  tar xvf apoa1.tar.gz

The apoa1_nve_cuda.namd configuration file is better suited for leveraging
GPU hardware than the config file included in the archive. 

  cd apoa1
  wget https://www.ks.uiuc.edu/Research/namd/2.13/benchmarks/apoa1_nve_cuda.namd
  cd ..

Testing ApoA1 on ATS/PVC 1-tile on host with 8 cores, level-zero backend:

  export ZE_AFFINITY_MASK=0.0
  ./namd2 +nostreaming +p8 +devices 0 +platform "Intel(R) Level-Zero" apoa1/apoa1_nve_cuda.namd

Collecting the non-bonded force kernel time:

  /opt/intel/oneapi/pti-gpu/onetrace/onetrace -d -o namd-timing.txt ./namd2 +nostreaming +p8 +devices 0 +platform "Intel(R) Level-Zero" apoa1/apoa1_nve_cuda.namd

Running larger benchmark systems:

Different benchmark systems have been tested with the SYCL/DPC++ build. 
F1-ATPase has ~327.5k atoms, and STMV has ~1.06M atoms. 

Running the F1-ATPase benchmark:

  wget https://www.ks.uiuc.edu/Research/namd/utilities/f1atpase.tar.gz
  tar xzf f1atpase.tar.gz
  ./namd2 +nostreaming +p64 +devices 0,1 +splittiles +pemap 4-67 +platform "Intel(R) Level-Zero" f1atpase/f1atpase.namd

Running the STMV benchmark:

  wget https://www.ks.uiuc.edu/Research/namd/utilities/stmv.tar.gz
  tar xzf stmv.tar.gz

The stmv_nve_cuda.namd configuration file is better suited for leveraging
GPU hardware than the config file included in the archive. 

  cd stmv
  wget https://www.ks.uiuc.edu/Research/namd/2.13/benchmarks/stmv_nve_cuda.namd
  cd ..

Performance is improved by offloading only the non-bonded force kernels and 
keeping the bonded calculations on the CPU (--bondedCUDA 0). The following 
runs multicore NAMD on 2 tiles:

  ./namd2 +nostreaming +p64 +devices 0,1 +splittiles +pemap 4-67 +platform "Intel(R) Level-Zero" --bondedCUDA 0 stmv/stmv_nve_cuda.namd

Building and running multi-node version:

Multi-node NAMD requires building a version of Charm++ that supports your 
parallel computer or cluster network. Shown below is how Charm++ can be 
built to use the Intel MPI installation:

  cd ../charm
  CC=icx; CXX=icpx; F90=ifort; F77=ifort; MPICXX=mpiicpc; MPI_CXX=mpiicpc
  I_MPI_CC=icx; I_MPI_CXX=icpx; I_MPI_F90=ifort; I_MPI_F77=ifort
  export I_MPI_CC I_MPI_CXX I_MPI_F90 I_MPI_F77 CC CXX F90 F77 MPICXX MPI_CXX
  ./build charm++ mpi-linux-x86_64 smp mpicxx --with-production
  cd ..

Once a working multi-node version of Charm++ is built, NAMD needs to be 
built using that new build of Charm++:

  ./config Linux-x86_64-dpcpp-AOT.mpi-smp --charm-arch mpi-linux-x86_64-smp-mpicxx
  cd Linux-x86_64-dpcpp-AOT.mpi-smp
  make -j

Modify the environment for multi-node support:

  unset ZE_AFFINITY_MASK
  export ZE_FLAT_DEVICE_HIERARCHY=FLAT

Setting the environment variable to "FLAT" treats each tile as a separate
device, so that the "+splittiles" option should no longer be used.

Running on 1 PVC GPU (2 tiles):

  mpirun -perhost 2 ./namd2 +ppn 31 +pemap 1-31:32.31,57-87:32.31 +commap 0-31:32,56-87:32 +devices 0,1 +platform "Intel(R) Level-Zero" stmv/stmv_nve_cuda.namd

Running on 4 PVC GPUs (8 tiles):

  mpirun -perhost 8 ./namd2 +ppn 7 +pemap 1-31:8.7,57-87:8.7 +commap 0-31:8,56-87:8 +devices 0,1,2,3,4,5,6,7 +platform "Intel(R) Level-Zero" stmv/stmv_nve_cuda.namd

Running on multiple nodes with 4 PVC GPUs (8 tiles) on each node:

  salloc -C IB -N 1 mpirun -perhost 8 ./namd2 +ppn 7 +pemap 1-31:8.7,57-87:8.7 +commap 0-31:8,56-87:8  +devices 0,1,2,3,4,5,6,7 +platform "Intel(R) Level-Zero" stmv/stmv_nve_cuda.namd

Multi-node/multi-process for NAMD are best scheduled with 1 device (tile) 
per rank (process). When running NAMD in this manner, the scalable parts 
of the PME algorithm (charge spreading to the grid and force interpolation 
from the grid) can be executed across the devices. The following options 
can be added to the config file to control offloading these calculations 
(and preventing the non-scalable PME calculations from being executed on 
a device):

  PMEoffload on  ;# scalable PME calculations requires one rank per device
  usePMECUDA off ;# disable offloading the non-scalable parts of PME

These (and other) options can also be set when launching NAMD:

  --PMEoffload on --usePMECUDA off

