replica exchange and GPU acceleration

From: Mitchell Gleed (aliigleed16_at_gmail.com)
Date: Tue Jun 30 2015 - 18:09:27 CDT

Hi,

I'm struggling to get GPU-accelerated replica-exchange jobs to run. I
thought that maybe this wasn't possible, but I found some older posts in
which some users reported success with it.

In the NAMD_2.10/lib/replica/umbrella directory, after tuning a couple
parameters in the configuration files, if I try to launch a replica
exchange job with a standard mpi NAMD build (mpirun -np 4 namd2 ++replicas
4 ... etc) things work fine. However, when I jump on a node with 4 GPU's
(2* tesla k80) and use a CUDA-enabled mpi NAMD build (mpirun -np 4 namd2
+idlepoll +replicas 4 ...), I get atom velocity errors at startup. These
are preceded by several warnings, after load balancing, similar to the
following: "0:18(1) atom 18 found 6 exclusions but expected 4"

The CUDA-enabled mpi build runs fine for non-REMD runs and provides great
acceleration to my simulations. I recognize the test case here might not be
the best-suited for CUDA acceleration due to its small size, but it should
at least run, right?

Now, I notice the log files for each replica all say "Pe 0 physical rank 0
binding to CUDA device 0," as if each replica is trying to use the same GPU
(device 0). (Persists even with +devices 0,1,2,3.) I suspect I could solve
the problem if I got each of four replicas to bind to different GPU's, but
I don't know how to assign GPU's to a specific replica. I assume that each
replica should have its own GPU, right, or can a GPU be shared by multiple
replicas? Ideally I'd like to be able to use 1 GPU node with 4 GPU's to run
16 replicas, or 4 GPU nodes for 16 gpu's, one per replica, if the former
isn't possible.

Does anyone have any advice to resolve the problem I'm running into?

Thanks!

Mitch Gleed

This archive was generated by hypermail 2.1.6 : Thu Dec 31 2015 - 23:21:56 CST