From: Mitchell Gleed (aliigleed16_at_gmail.com)
Date: Tue Jun 30 2015 - 18:09:27 CDT
I'm struggling to get GPU-accelerated replica-exchange jobs to run. I
thought that maybe this wasn't possible, but I found some older posts in
which some users reported success with it.
In the NAMD_2.10/lib/replica/umbrella directory, after tuning a couple
parameters in the configuration files, if I try to launch a replica
exchange job with a standard mpi NAMD build (mpirun -np 4 namd2 ++replicas
4 ... etc) things work fine. However, when I jump on a node with 4 GPU's
(2* tesla k80) and use a CUDA-enabled mpi NAMD build (mpirun -np 4 namd2
+idlepoll +replicas 4 ...), I get atom velocity errors at startup. These
are preceded by several warnings, after load balancing, similar to the
following: "0:18(1) atom 18 found 6 exclusions but expected 4"
The CUDA-enabled mpi build runs fine for non-REMD runs and provides great
acceleration to my simulations. I recognize the test case here might not be
the best-suited for CUDA acceleration due to its small size, but it should
at least run, right?
Now, I notice the log files for each replica all say "Pe 0 physical rank 0
binding to CUDA device 0," as if each replica is trying to use the same GPU
(device 0). (Persists even with +devices 0,1,2,3.) I suspect I could solve
the problem if I got each of four replicas to bind to different GPU's, but
I don't know how to assign GPU's to a specific replica. I assume that each
replica should have its own GPU, right, or can a GPU be shared by multiple
replicas? Ideally I'd like to be able to use 1 GPU node with 4 GPU's to run
16 replicas, or 4 GPU nodes for 16 gpu's, one per replica, if the former
Does anyone have any advice to resolve the problem I'm running into?
This archive was generated by hypermail 2.1.6 : Tue Dec 27 2016 - 23:21:12 CST