Re: Getting high performance on multi-copy (replica) GPU simulations

From: Victor Zhao (yzhao01_at_g.harvard.edu)
Date: Wed Sep 23 2020 - 00:04:27 CDT

Hi Josh,

Thanks for your help. I think ++remote-shell srun was what helped get things going. In prior attempts, I didn’t have that argument in, and it would give me “Multiple PEs assigned to same core,” and recommend +setcpuaffinity. But after I put in +setcpuaffinity, it would run, but the performance would be poor. Now, with your suggested sbatch and charmrun arguments, performance in these replica exchange simulations is as fast as I can get in either a single GPU equilibrium simulation or in the replica exchange simulations where I was using 1 GPU/node. Furthermore, I started testing the NAMD 3 alpha 6.0 build today, and I find the performance there is nearly double when I use V100 GPUs (but performance does not improve when using much older, K20m GPUs).

In my setup, I don’t get bottlenecked by CPU integration since each compute node has at least 16 cores. Running 8 CPUs/GPU gives good performance (whereas the gain from 16 CPUs/GPU is small). In my prior use case of 1 GPU/node, a compute node would still have both GPUs working, but each on a different replica exchange job, and I wasn’t observing bottlenecking then.

Best,
Victor

> On Sep 22, 2020, at 2:04 PM, Josh Vermaas <joshua.vermaas_at_gmail.com> wrote:
>
> Hi Victor,
>
> In my experience, the simpler slurm arguments are usually better. Have
> you tried the simplest option yet? From
> https://www.ks.uiuc.edu/Research/namd/wiki/index.cgi?NamdOnSLURM, I
> think you'd do something like this:
>
> #/bin/bash
> #SBATCH -N 1
> #SBATCH --gres=gpu:2
> #SBATCH -n 8
> expot NAMDDIR=/some/path/that/I/dont/know
> $NAMDDIR/charmrun ++n 2 ++ppn 4 ++mpiexec ++remote-shell srun
> $NAMDDIR/namd2 +replicas 2 +devicesperreplica 1 run.conf +stdout
> output/%d/log_rest.%d.log
>
> My expectation is that this would run at ~50% of the speed of the 1
> replica per node case, since the integration is still happening on the
> CPU, and is often the bottleneck in NAMD 2.X builds. For NAMD 3.0, you
> might be able to better use your hardware, but I'd first read up on
> https://developer.nvidia.com/blog/delivering-up-to-9x-throughput-with-namd-v3-and-a100-gpu/
> to understand what some of the differences are between NAMD 3 and 2.X.
> You'd particularly care about Fig. 11, since your total throughput only
> grows linearly with the number of GPUs with NAMD 3, not NAMD 2.14.
>
> -Josh
>
>
> On 9/22/20 10:12 AM, Victor Zhao wrote:
>> Hello,
>>
>> I’ve been running replica exchange simulations on GPU with NAMD (NAMD_2.14_Linux-x86_64-netlrts-smp-CUDA). I am running in a computing cluster environment with SLURM. I am currently able to get good performance—days/ns comparable to single-GPU performance—when I use 1 GPU per compute node. That would mean 8 nodes for 8 replicas, for instance. In my case, each of my nodes actually has 2 GPUs. If I run multiple replica exchange simulations, I can still efficiently use my GPUs because jobs will be assigned to nodes in a round robin sort of fashion. E.g. my first job will use the first GPU of several nodes, and the second job can use the second GPU of those nodes.
>>
>> But I cannot find the right way to launch my replica exchange simulations so that I can use more than 1 GPU per node and have high performance (still want to run 1 GPU per replica). This is relevant to me because there are other GPU computing resources where my current way of running would not be viable (due to computing resources being shared).
>>
>> Each time I try out SLURM and NAMD options to use more than 1 GPU/node, I get bad performance.
>>
>> Here’s what works for 2-replica test simulations (1 GPU/node):
>>
>> ```
>> sbatch -N 2 --gres=gpu:1 --mem-per-cpu 500 --ntasks-per-node 1 --cpus-per-task 8 submit_namd.sh
>> # where submit_namd.sh contains
>> $NAMDDIR/charmrun ++mpiexec +p 16 ++ppn 8 $NAMDDIR/namd2 +setcpuaffinity +replicas 2 run.conf +stdout output/%d/log_rest.%d.log
>> ```
>>
>> Here’s what gives poor performance (2 GPU/node, 2-replica test simulation)
>> ```
>> sbatch -N 1 --gres=gpu:2 --mem-per-cpu 500 --ntasks 2 --cpus-per-task 4 submit_namd.sh
>> # where submit_namd.sh contains
>> $NAMDDIR/charmrun ++mpiexec +p 8 ++ppn 4 $NAMDDIR/namd2 +setcpuaffinity +replicas 2 +devicesperreplica 1 run.conf +stdout output/%d/log_rest.%d.log
>> ```
>>
>> The latter performs at about 5% of the speed of the former! Surprisingly, calling nvidia-smi on the running node shows that both GPUs in use with a different process ID assigned to each GPU (with low GPU utilization).
>>
>> I’ve tried many different combinations. Some of them don’t even run. The ones that do give poor performance. Does anyone have any tips on running?
>>
>> Best,
>> Victor
>>
>>

This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:09 CST