Re: Getting high performance on multi-copy (replica) GPU simulations

From: Josh Vermaas (joshua.vermaas_at_gmail.com)
Date: Tue Sep 22 2020 - 13:04:26 CDT

Hi Victor,

In my experience, the simpler slurm arguments are usually better. Have
you tried the simplest option yet? From
https://www.ks.uiuc.edu/Research/namd/wiki/index.cgi?NamdOnSLURM, I
think you'd do something like this:

#/bin/bash
#SBATCH -N 1
#SBATCH --gres=gpu:2
#SBATCH -n 8
expot NAMDDIR=/some/path/that/I/dont/know
$NAMDDIR/charmrun ++n 2 ++ppn 4 ++mpiexec ++remote-shell srun
$NAMDDIR/namd2 +replicas 2 +devicesperreplica 1 run.conf +stdout
output/%d/log_rest.%d.log

My expectation is that this would run at ~50% of the speed of the 1
replica per node case, since the integration is still happening on the
CPU, and is often the bottleneck in NAMD 2.X builds. For NAMD 3.0, you
might be able to better use your hardware, but I'd first read up on
https://developer.nvidia.com/blog/delivering-up-to-9x-throughput-with-namd-v3-and-a100-gpu/
to understand what some of the differences are between NAMD 3 and 2.X.
You'd particularly care about Fig. 11, since your total throughput only
grows linearly with the number of GPUs with NAMD 3, not NAMD 2.14.

-Josh

On 9/22/20 10:12 AM, Victor Zhao wrote:
> Hello,
>
> I’ve been running replica exchange simulations on GPU with NAMD (NAMD_2.14_Linux-x86_64-netlrts-smp-CUDA). I am running in a computing cluster environment with SLURM. I am currently able to get good performance—days/ns comparable to single-GPU performance—when I use 1 GPU per compute node. That would mean 8 nodes for 8 replicas, for instance. In my case, each of my nodes actually has 2 GPUs. If I run multiple replica exchange simulations, I can still efficiently use my GPUs because jobs will be assigned to nodes in a round robin sort of fashion. E.g. my first job will use the first GPU of several nodes, and the second job can use the second GPU of those nodes.
>
> But I cannot find the right way to launch my replica exchange simulations so that I can use more than 1 GPU per node and have high performance (still want to run 1 GPU per replica). This is relevant to me because there are other GPU computing resources where my current way of running would not be viable (due to computing resources being shared).
>
> Each time I try out SLURM and NAMD options to use more than 1 GPU/node, I get bad performance.
>
> Here’s what works for 2-replica test simulations (1 GPU/node):
>
> ```
> sbatch -N 2 --gres=gpu:1 --mem-per-cpu 500 --ntasks-per-node 1 --cpus-per-task 8 submit_namd.sh
> # where submit_namd.sh contains
> $NAMDDIR/charmrun ++mpiexec +p 16 ++ppn 8 $NAMDDIR/namd2 +setcpuaffinity +replicas 2 run.conf +stdout output/%d/log_rest.%d.log
> ```
>
> Here’s what gives poor performance (2 GPU/node, 2-replica test simulation)
> ```
> sbatch -N 1 --gres=gpu:2 --mem-per-cpu 500 --ntasks 2 --cpus-per-task 4 submit_namd.sh
> # where submit_namd.sh contains
> $NAMDDIR/charmrun ++mpiexec +p 8 ++ppn 4 $NAMDDIR/namd2 +setcpuaffinity +replicas 2 +devicesperreplica 1 run.conf +stdout output/%d/log_rest.%d.log
> ```
>
> The latter performs at about 5% of the speed of the former! Surprisingly, calling nvidia-smi on the running node shows that both GPUs in use with a different process ID assigned to each GPU (with low GPU utilization).
>
> I’ve tried many different combinations. Some of them don’t even run. The ones that do give poor performance. Does anyone have any tips on running?
>
> Best,
> Victor
>
>

This archive was generated by hypermail 2.1.6 : Thu Dec 31 2020 - 23:17:14 CST