Re: performance scaling of CUDA accelerated NAMD over multiple nodes

From: Vermaas, Josh (
Date: Fri Sep 17 2021 - 07:31:45 CDT

Hi Vlad,

I've run NAMD on 4000 nodes, and it'll scale just fine (although the system was much larger than 500k atoms!). There are a few gotchas involved with multinode GPU NAMD. In no particular order:

1. This is an SMP build, yeah? Straight MPI builds with CUDA support are *possible* to build, but perform terribly relative to their SMP bretheren
2. I've found that each node performs best when a GPU gets its own rank/task that command dedicated CPUs. On the local resources here at MSU, that looks something like this:

#SBATCH --gres=gpu:4 #4 GPUs per node
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4 #Number of tasks per node should match the number of GPUs
#SBATCH --cpus-per-task=12 #48 CPUs total, means each task gets 12
#SBATCH --gpu-bind=single:1 #Bind the GPU to a single task. Prevents a CPU from trying to distribute work over multiple GPUs, and lowers PCIE contention

module use /mnt/home/vermaasj/modules
module load NAMD/2.14-gpu
#other modules are loaded automatically by the NAMD module.
srun namd2 +ppn 11 +ignoresharing configfile.namd > logfile.log
#With this setup, NAMD sees 8 logical nodes, 4 from each physical node.

3. Set expectations appropriately. 10 nodes with 4 GPUs each = 40 GPUs. If the only thing you were doing is simulating 500k atoms (no replicas or anything), each GPU is responsible for ~10k atoms. There are two layers of communication for NAMD 2.14 on GPUs, transfers across the PCIe bus between GPU and CPU every timestep, and communication between logical nodes whenever pairlists get recomputed. If there isn't enough work for each GPU to do, those extra communication steps are going to murder performance. TLDR, at some point the scaling will break down, and for a system that small, it might happen before you think it will.

4. If the simulations you are planning are going to be regular equilibrium simulations, NAMD3 will likely be faster on modern hardware, as it eliminates CPU-GPU communication at most timesteps.


On 9/17/21, 6:44 AM, " on behalf of Vlad Cojocaru" < on behalf of> wrote:

    Dear all,

    We have been doing some tests with the CUDA (11 I believe) accelerated
    version of NAMD 2.14 on a remote supercomputer. On 1 node (96 threads, 4
    GPUs), we see a 10 fold acceleration compared to a non-CUDA NAMD 2.14.
    There is a decent scaling between 1 and 2 GPUs but from 2 to 4 GPUs
    almost no scaling. The simulation (classical MD) time per day for 500K
    atoms is similar to what expected (comparable to what is published on
    the NAMD website).

    However, for a large scale project, the supercomputer site requires
    scaling up to at least 10 nodes. And we are not able to get any scaling
    to more than 1 node. In fact, as soon as running on 2 nodes (with 4 GPUs
    each), the performance is getting worse than on a single node.

    I know that lots of details are needed to actually pinpoint the
    issue(s), many of these are architecture dependent and we do not have
    all these details.

    However, I would still like to ask in general if any of you has
    routinely managed to scale up the performance of the CUDA accelerated
    NAMD 2.14 with the number of nodes. And if yes, are there any general
    tips and tricks that could be tried ?

    Thank you for any insights !

    Vlad Cojocaru, PD (Habil.), Ph.D.
    Project Group Leader
    Department of Cell and Developmental Biology
    Max Planck Institute for Molecular Biomedicine
    Röntgenstrasse 20, 48149 Münster, Germany
    Tel: +49-251-70365-324; Fax: +49-251-70365-399
    Email: vlad.cojocaru[at];!!DZ3fjg!rk6Iphoo5rM4f2l2YT1vs5SwsRjzDgBZatmHwqU4VwajBWBxMQ4BM2F0_C7PXf-jew$

This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:11 CST