From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Mon Jul 06 2015 - 01:38:11 CDT
Ok, as far as I remember you need to have at least 2 processes per replicas. So you’d need to use 32 cores when using 16 replicas.
So depending on your values your nodes have 4 cores and 4 gpus? 
What’s the output of “cat /proc/cpuinfo” from one of the compute nodes?
 
Norman Geist.
 
From: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] On Behalf Of Mitchell Gleed
Sent: Saturday, July 04, 2015 5:49 PM
To: NAMD list; Norman Geist
Subject: Re: namd-l: replica exchange and GPU acceleration
 
I'd like to use 16 cores, 1 node, 4 gpus, 16 replicas. For most of the testing, I've tried with 4 cores, 1 & 4 nodes, 4 gpus, 4 replicas with the same trouble.
 
Example slurm request: 
salloc --nodes=1 --ntasks=4 --gres=gpu:4 --mem-per-cpu=2048M -t24:00:00
 
Example namd launch scheme: 
#!/bin/bash
cd NAMD_2.10_Source/lib/replica/example
export procs=4
module load namd/2.10_cuda-6.5.14_openmpi-1.6.5_gnu-4.8.2
mpirun -np $procs $(which namd2) +replicas $procs job0.conf +stdout output/%d/job0.%d.log # relevant namd/replica scripts use $::env(procs) replicas
 
Again, I can get these things to work fine following this pattern on a non-CUDA build, but with the CUDA build I run into errors. I've used the CUDA build fine for non-REMD simulations.
 
Mitch
 
On Fri, Jul 3, 2015 at 1:16 AM, Norman Geist <norman.geist_at_uni-greifswald.de> wrote:
Can you tell how many cores, nodes, gpus and replicas you are trying to use? There are some requirements on those numbers to be multiples of each other.
 
Norman Geist.
 
From: Mitchell Gleed [mailto:aliigleed16_at_gmail.com] 
Sent: Wednesday, July 01, 2015 6:02 PM
To: NAMD list; Norman Geist
Subject: Re: namd-l: replica exchange and GPU acceleration
 
Thank you Norman. The info about sharing GPU's is very helpful. 
 
I didn't have any success upon adding a minimization line before REMD to either the alanin_base.conf or the replica.namd scripts. In either case, I get errors like "0:18(1) atom 18 found 6 exclusions but expected 4" at nearly every minimization step. Although minimization runs successfully despite these errors, simulations just crash at step 0 with velocity errors. I've run into the same error in /lib/replica/example and /lib/replica/umbrella test cases. 
 
I found another thread where they talk about a similar warning (atom X found Y exclusions but expected Z) and they found that their error was an improper resource request for their nodes, but I haven't had any issues before with running standard GPU-accelerated jobs, just these REMD-type jobs. And I've been quite cautious to request the right resources in SLURM and make sure the command to launch NAMD matches.
 
If anyone has any other ideas, let me know. If you wouldn't mind, Norman, would you mind sharing a bit more information for how you get it to run? (mpi type/version, architecture, commands to launch NAMD for REMD w/ GPU, etc.)
 
Thanks again,
 
Mitch
 
On Wed, Jul 1, 2015 at 12:13 AM, Norman Geist <norman.geist_at_uni-greifswald.de> wrote:
 
From: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] On Behalf Of Mitchell Gleed
Sent: Wednesday, July 01, 2015 1:09 AM
To: NAMD list
Subject: namd-l: replica exchange and GPU acceleration
 
Hi,
 
Hey
 
I'm struggling to get GPU-accelerated replica-exchange jobs to run. I thought that maybe this wasn't possible, but I found some older posts in which some users reported success with it.
 
In the NAMD_2.10/lib/replica/umbrella directory, after tuning a couple parameters in the configuration files, if I try to launch a replica exchange job with a standard mpi NAMD build (mpirun -np 4 namd2 ++replicas 4 ... etc) things work fine. However, when I jump on a node with 4 GPU's (2* tesla k80) and use a CUDA-enabled mpi NAMD build  (mpirun -np 4 namd2 +idlepoll +replicas 4 ...), I get atom velocity errors at startup. These are preceded by several warnings, after load balancing, similar to the following: "0:18(1) atom 18 found 6 exclusions but expected 4" 
 
I’ve run a lot of GPU REMD jobs successfully, so it’s possible and running fine.
 
The CUDA-enabled mpi build runs fine for non-REMD runs and provides great acceleration to my simulations. I recognize the test case here might not be the best-suited for CUDA acceleration due to its small size, but it should at least run, right?
 
Now, I notice the log files for each replica all say "Pe 0 physical rank 0 binding to CUDA device 0," as if each replica is trying to use the same GPU (device 0). (Persists even with +devices 0,1,2,3.) I suspect I could solve the problem if I got each of four replicas to bind to different GPU's, but I don't know how to assign GPU's to a specific replica. I assume that each replica should have its own GPU, right, or can a GPU be shared by multiple replicas? Ideally I'd like to be able to use 1 GPU node with 4 GPU's to run 16 replicas, or 4 GPU nodes for 16 gpu's, one per replica, if the former isn't possible.
 
Sharing GPUs between replicas is ok. If you have multiple replica per node you can’t actually prevent it. You can of course oversubscribe your nodes as long as memory is sufficient, you just need to use #replicas*2 processes. You shouldn’t need to care about GPU binding at all.
 
Does anyone have any advice to resolve the problem I'm running into?
 
The warning and error you observe might be due starting a high temperature run without a leading minimization run. Try minimizing the system before starting the REMD. I actually extended the replica.namd script by a configurable minimization feature to get around such issues.
 
Regards
 
Norman Geist
 
Thanks!
 
Mitch Gleed
 
 
This archive was generated by hypermail 2.1.6 : Thu Dec 31 2015 - 23:21:57 CST