Re: performance scaling of CUDA accelerated NAMD over multiple nodes

From: Giacomo Fiorin (giacomo.fiorin_at_gmail.com)
Date: Sat Sep 18 2021 - 10:52:51 CDT

Next message: Jérôme Hénin: "Re: Merge ABF windows"
Previous message: Vlad Cojocaru: "Re: performance scaling of CUDA accelerated NAMD over multiple nodes"
In reply to: Vlad Cojocaru: "Re: performance scaling of CUDA accelerated NAMD over multiple nodes"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

I believe that is highly dependent on which GPU models you are using, and
how they're connected to the CPU. With modern GPUs compute capacity is not
the limiting factor, but the speed of the connection with the CPUs matters
whenever the app needs to do something that it doesn't have GPU code for.
PCIe will be a lot slower than NVLink, regardless of CPU and GPU model.
I'd search the mailing list of NAMD and other codes (e.g. GROMACS or LAMMPS
using the GPU package), because this issue has been discussed many times.

One way to circumvent the above limitation is to have everything run on the
GPU itself (short- and long-range forces, thermostats, MD integration,
etc), which is what NAMD 3 does (as well as AMBER, OpenMM, or LAMMPS using
the Kokkos package). Needless to say, this scheme takes more
implementation work because now you need to have GPU code for nearly every
feature you're using (so the folks at UIUC have begun by re-coding again in
CUDA the most common ones). But it does pay off greatly in speed. This
particular point is also discussed fairly well in the new NAMD paper.

TL;DR, because GPUs are more specialized and less flexible than CPUs, the
application can't easily rebalance the workload and predicting the
performance from little information is a lot harder.

Giacomo

On Sat, Sep 18, 2021 at 8:00 AM Vlad Cojocaru <
vlad.cojocaru_at_mpi-muenster.mpg.de> wrote:

> Thanks Giacomo,
>
> O forwarded your email to the HPC site. They are very helpful and will try
> to see if we get the bottleneck of this by rebuilding NAMD.
>
> But coming back to Josh's point about the low workload on the GPU. This
> may be a valid point. Of course, we will do replicate of those simulations,
> so one way to circumvent the issue is to run at least 10 simulations in
> parallel, each of them using one node. ...
>
> But I am wondering ... Whithin one simulation, what would be the minimum
> number of atoms allocated to 1 GPU to have a decent workload on the GPU ?
>
> Best
> Vlad
>
>
> On 9/17/21 15:41, Giacomo Fiorin wrote:
>
> Hi Vlad, it seems that they recommended you to use the UCX backend for
> Charm++. This is a good idea, because the UCX middleware can
> help significantly toward optimizing the communication paths in a complex
> or congested network. *However,* Charm++ is still in its early stages of
> adoption of UCX, and there are several things to look at when building:
> https://urldefense.com/v3/__https://charm.readthedocs.io/en/latest/charm**A/manual.html*ucx__;Kysj!!DZ3fjg!q46PU6xcaGK8jqycGVPshHR2_FMvyptIcQlhJscqGr0oSm6aHTWKRCyU_MGscT1ktg$
>
> One possibility in your case is that the wrong PMI (*distinct *from MPI)
> for the purpose is being picked up and the processes or threads aren't
> allocated correctly.
>
> If you're confident that you really do need multi-node (see Josh's good
> suggestion about NAMD 3) I suggest the following:
> 1. Build and test NAMD with Charm++/UCX without SMP and without CUDA.
> Between 1 and 2 nodes you ought to get linear scaling for 500k atoms.
> 2. Build and test NAMD with Charm++/UCX with SMP but still without CUDA.
> You should see a slight decrease of the whole scaling curve because the SMP
> backend separates communication and compute in separate threads, and wants
> dedicated ones for each. But SMP is also what works best when throwing in
> the GPUs.
> 3. Build and test NAMD with both SMP and CUDA.
>
> As a minor suggestion, try also to verify that the AVX instructions being
> recommended are actually paying off. Not all CPUs have a high enough clock
> rate for their AVX units to make it worth using them compared to standard
> floating-point ops.
>
> I don't know what is the level of support from the HPC staff of that
> facility. But in general it is a good rule to treat building a code as an
> experiment that needs verification at every step of the way (something we
> scientists are accustomed to, or ought to be).
>
> Giacomo
>
>
> On Fri, Sep 17, 2021 at 9:13 AM Vlad Cojocaru <
> vlad.cojocaru_at_mpi-muenster.mpg.de> wrote:
>
>> Hi Josh
>>
>> Thanks a lot for sharing this. I don't have experience with running the
>> GPU NAMD, this is the first time I actually decided to test it thoroughly.
>>
>> I think the NAMD build was not a pure SMP built. Maybe this is where the
>> problems come from in the first place .... I share below the build
>> procedure recommended by the HPC site. If there is anything that you
>> immediately spot to be problematic, I could give it a try to build NAMD
>> again.
>>
>> I will also share your email with the support team at the HPC site.
>>
>> Best wishes
>> Vlad
>>
>> #### NAMD build procedure ####
>>
>> module load Intel ParaStationMPI FFTW Tcl
>> tar -xf NAMD_2.14_Source.tar.gz
>> cd NAMD_2.14_Source
>> tar -xf charm-6.10.2.tar
>> cd charm-6.10.2
>> ./build charm++ ucx-linux-x86_64 icc smp --with-production
>> cd ..
>> ./config Linux-x86_64-icc --charm-arch ucx-linux-x86_64-smp-icc --with-tcl --tcl-prefix $EBROOTTCL --cc "mpicc" --cc-opts "-O3 -march=core-avx2 -ftz -fp-speculation=safe -fp-model source -fPIC -std=c++11" --cxx "mpicxx" --cxx-opts "-O3 -march=core-avx2 -ftz -fp-speculation=safe -fp-model source -fPIC -std=c++11 --std=c++11" --with-tcl --tcl-prefix $EBROOTTCL --with-fftw3 --fftw-prefix $EBROOTFFTW --with-cuda
>> cd Linux-x86_64-icc
>> echo "TCLLIB=-L\$(EBROOTTCL)/lib -ltcl8.6 -ldl -lpthread" >> Make.config
>> echo "COPTS+=-DNAMD_DISABLE_SSE" >> Make.config
>> echo "CXXOPTS+=-DNAMD_DISABLE_SSE" >> Make.config
>> make
>>
>>
>>
>>
>> On 9/17/21 14:31, Vermaas, Josh wrote:
>>
>> Hi Vlad,
>>
>> I've run NAMD on 4000 nodes, and it'll scale just fine (although the system was much larger than 500k atoms!). There are a few gotchas involved with multinode GPU NAMD. In no particular order:
>>
>> 1. This is an SMP build, yeah? Straight MPI builds with CUDA support are *possible* to build, but perform terribly relative to their SMP bretheren
>> 2. I've found that each node performs best when a GPU gets its own rank/task that command dedicated CPUs. On the local resources here at MSU, that looks something like this:
>>
>> #!/bin/bash
>> #SBATCH --gres=gpu:4 #4 GPUs per node
>> #SBATCH --nodes=2
>> #SBATCH --ntasks-per-node=4 #Number of tasks per node should match the number of GPUs
>> #SBATCH --cpus-per-task=12 #48 CPUs total, means each task gets 12
>> #SBATCH --gpu-bind=single:1 #Bind the GPU to a single task. Prevents a CPU from trying to distribute work over multiple GPUs, and lowers PCIE contention
>>
>> cd $SLURM_SUBMIT_DIR
>> module use /mnt/home/vermaasj/modules
>> module load NAMD/2.14-gpu
>> #other modules are loaded automatically by the NAMD module.
>> srun namd2 +ppn 11 +ignoresharing configfile.namd > logfile.log
>> #With this setup, NAMD sees 8 logical nodes, 4 from each physical node.
>>
>> 3. Set expectations appropriately. 10 nodes with 4 GPUs each = 40 GPUs. If the only thing you were doing is simulating 500k atoms (no replicas or anything), each GPU is responsible for ~10k atoms. There are two layers of communication for NAMD 2.14 on GPUs, transfers across the PCIe bus between GPU and CPU every timestep, and communication between logical nodes whenever pairlists get recomputed. If there isn't enough work for each GPU to do, those extra communication steps are going to murder performance. TLDR, at some point the scaling will break down, and for a system that small, it might happen before you think it will.
>>
>> 4. If the simulations you are planning are going to be regular equilibrium simulations, NAMD3 will likely be faster on modern hardware, as it eliminates CPU-GPU communication at most timesteps.
>>
>> -Josh
>>
>> On 9/17/21, 6:44 AM, "owner-namd-l_at_ks.uiuc.edu on behalf of Vlad Cojocaru" <owner-namd-l_at_ks.uiuc.eduonbehalfofVladCojocaru> <owner-namd-l_at_ks.uiuc.edu on behalf of vlad.cojocaru_at_mpi-muenster.mpg.de> <owner-namd-l_at_ks.uiuc.eduonbehalfofvlad.cojocaru@mpi-muenster.mpg.de> wrote:
>>
>> Dear all,
>>
>> We have been doing some tests with the CUDA (11 I believe) accelerated
>> version of NAMD 2.14 on a remote supercomputer. On 1 node (96 threads, 4
>> GPUs), we see a 10 fold acceleration compared to a non-CUDA NAMD 2.14.
>> There is a decent scaling between 1 and 2 GPUs but from 2 to 4 GPUs
>> almost no scaling. The simulation (classical MD) time per day for 500K
>> atoms is similar to what expected (comparable to what is published on
>> the NAMD website).
>>
>> However, for a large scale project, the supercomputer site requires
>> scaling up to at least 10 nodes. And we are not able to get any scaling
>> to more than 1 node. In fact, as soon as running on 2 nodes (with 4 GPUs
>> each), the performance is getting worse than on a single node.
>>
>> I know that lots of details are needed to actually pinpoint the
>> issue(s), many of these are architecture dependent and we do not have
>> all these details.
>>
>> However, I would still like to ask in general if any of you has
>> routinely managed to scale up the performance of the CUDA accelerated
>> NAMD 2.14 with the number of nodes. And if yes, are there any general
>> tips and tricks that could be tried ?
>>
>> Thank you for any insights !
>> Vlad
>>
>> --
>> Vlad Cojocaru, PD (Habil.), Ph.D.
>> -----------------------------------------------
>> Project Group Leader
>> Department of Cell and Developmental Biology
>> Max Planck Institute for Molecular Biomedicine
>> Röntgenstrasse 20, 48149 Münster, Germany
>> -----------------------------------------------
>> Tel: +49-251-70365-324; Fax: +49-251-70365-399
>> Email: vlad.cojocaru[at]mpi-muenster.mpg.de
>> https://urldefense.com/v3/__http://www.mpi-muenster.mpg.de/43241/cojocaru__;!!DZ3fjg!rk6Iphoo5rM4f2l2YT1vs5SwsRjzDgBZatmHwqU4VwajBWBxMQ4BM2F0_C7PXf-jew$
>>
>>
>>
>> --
>> Vlad Cojocaru, PD (Habil.), Ph.D.
>> -----------------------------------------------
>> Project Group Leader
>> Department of Cell and Developmental Biology
>> Max Planck Institute for Molecular Biomedicine
>> Röntgenstrasse 20, 48149 Münster, Germany
>> -----------------------------------------------
>> Tel: +49-251-70365-324; Fax: +49-251-70365-399
>> Email: vlad.cojocaru[at]mpi-muenster.mpg.dehttp://www.mpi-muenster.mpg.de/43241/cojocaru <https://urldefense.com/v3/__http://www.mpi-muenster.mpg.de/43241/cojocaru__;!!DZ3fjg!r1ccVWYnFhS8SM_XTIXz4Bvted68phFds9smvuTnBJXhCZq4CXh01dvJQFPjIgeDSQ$>
>>
>>
> --
> Vlad Cojocaru, PD (Habil.), Ph.D.
> -----------------------------------------------
> Project Group Leader
> Department of Cell and Developmental Biology
> Max Planck Institute for Molecular Biomedicine
> Röntgenstrasse 20, 48149 Münster, Germany
> -----------------------------------------------
> Tel: +49-251-70365-324; Fax: +49-251-70365-399
> Email: vlad.cojocaru[at]mpi-muenster.mpg.dehttp://www.mpi-muenster.mpg.de/43241/cojocaru
>
>

Next message: Jérôme Hénin: "Re: Merge ABF windows"
Previous message: Vlad Cojocaru: "Re: performance scaling of CUDA accelerated NAMD over multiple nodes"
In reply to: Vlad Cojocaru: "Re: performance scaling of CUDA accelerated NAMD over multiple nodes"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:11 CST