Getting high performance on multi-copy (replica) GPU simulations

From: Victor Zhao (yzhao01_at_g.harvard.edu)
Date: Tue Sep 22 2020 - 11:12:58 CDT

Hello,

I’ve been running replica exchange simulations on GPU with NAMD (NAMD_2.14_Linux-x86_64-netlrts-smp-CUDA). I am running in a computing cluster environment with SLURM. I am currently able to get good performance—days/ns comparable to single-GPU performance—when I use 1 GPU per compute node. That would mean 8 nodes for 8 replicas, for instance. In my case, each of my nodes actually has 2 GPUs. If I run multiple replica exchange simulations, I can still efficiently use my GPUs because jobs will be assigned to nodes in a round robin sort of fashion. E.g. my first job will use the first GPU of several nodes, and the second job can use the second GPU of those nodes.

But I cannot find the right way to launch my replica exchange simulations so that I can use more than 1 GPU per node and have high performance (still want to run 1 GPU per replica). This is relevant to me because there are other GPU computing resources where my current way of running would not be viable (due to computing resources being shared).

Each time I try out SLURM and NAMD options to use more than 1 GPU/node, I get bad performance.

Here’s what works for 2-replica test simulations (1 GPU/node):

```
sbatch -N 2 --gres=gpu:1 --mem-per-cpu 500 --ntasks-per-node 1 --cpus-per-task 8 submit_namd.sh
# where submit_namd.sh contains
$NAMDDIR/charmrun ++mpiexec +p 16 ++ppn 8 $NAMDDIR/namd2 +setcpuaffinity +replicas 2 run.conf +stdout output/%d/log_rest.%d.log
```

Here’s what gives poor performance (2 GPU/node, 2-replica test simulation)
```
sbatch -N 1 --gres=gpu:2 --mem-per-cpu 500 --ntasks 2 --cpus-per-task 4 submit_namd.sh
# where submit_namd.sh contains
$NAMDDIR/charmrun ++mpiexec +p 8 ++ppn 4 $NAMDDIR/namd2 +setcpuaffinity +replicas 2 +devicesperreplica 1 run.conf +stdout output/%d/log_rest.%d.log
```

The latter performs at about 5% of the speed of the former! Surprisingly, calling nvidia-smi on the running node shows that both GPUs in use with a different process ID assigned to each GPU (with low GPU utilization).

I’ve tried many different combinations. Some of them don’t even run. The ones that do give poor performance. Does anyone have any tips on running?

Best,
Victor

This archive was generated by hypermail 2.1.6 : Thu Dec 31 2020 - 23:17:14 CST