From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Fri Jul 03 2015 - 03:16:17 CDT
Can you tell how many cores, nodes, gpus and replicas you are trying to use? There are some requirements on those numbers to be multiples of each other.
From: Mitchell Gleed [mailto:aliigleed16_at_gmail.com]
Sent: Wednesday, July 01, 2015 6:02 PM
To: NAMD list; Norman Geist
Subject: Re: namd-l: replica exchange and GPU acceleration
Thank you Norman. The info about sharing GPU's is very helpful.
I didn't have any success upon adding a minimization line before REMD to either the alanin_base.conf or the replica.namd scripts. In either case, I get errors like "0:18(1) atom 18 found 6 exclusions but expected 4" at nearly every minimization step. Although minimization runs successfully despite these errors, simulations just crash at step 0 with velocity errors. I've run into the same error in /lib/replica/example and /lib/replica/umbrella test cases.
I found another thread where they talk about a similar warning (atom X found Y exclusions but expected Z) and they found that their error was an improper resource request for their nodes, but I haven't had any issues before with running standard GPU-accelerated jobs, just these REMD-type jobs. And I've been quite cautious to request the right resources in SLURM and make sure the command to launch NAMD matches.
If anyone has any other ideas, let me know. If you wouldn't mind, Norman, would you mind sharing a bit more information for how you get it to run? (mpi type/version, architecture, commands to launch NAMD for REMD w/ GPU, etc.)
On Wed, Jul 1, 2015 at 12:13 AM, Norman Geist <norman.geist_at_uni-greifswald.de> wrote:
From: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] On Behalf Of Mitchell Gleed
Sent: Wednesday, July 01, 2015 1:09 AM
To: NAMD list
Subject: namd-l: replica exchange and GPU acceleration
I'm struggling to get GPU-accelerated replica-exchange jobs to run. I thought that maybe this wasn't possible, but I found some older posts in which some users reported success with it.
In the NAMD_2.10/lib/replica/umbrella directory, after tuning a couple parameters in the configuration files, if I try to launch a replica exchange job with a standard mpi NAMD build (mpirun -np 4 namd2 ++replicas 4 ... etc) things work fine. However, when I jump on a node with 4 GPU's (2* tesla k80) and use a CUDA-enabled mpi NAMD build (mpirun -np 4 namd2 +idlepoll +replicas 4 ...), I get atom velocity errors at startup. These are preceded by several warnings, after load balancing, similar to the following: "0:18(1) atom 18 found 6 exclusions but expected 4"
I’ve run a lot of GPU REMD jobs successfully, so it’s possible and running fine.
The CUDA-enabled mpi build runs fine for non-REMD runs and provides great acceleration to my simulations. I recognize the test case here might not be the best-suited for CUDA acceleration due to its small size, but it should at least run, right?
Now, I notice the log files for each replica all say "Pe 0 physical rank 0 binding to CUDA device 0," as if each replica is trying to use the same GPU (device 0). (Persists even with +devices 0,1,2,3.) I suspect I could solve the problem if I got each of four replicas to bind to different GPU's, but I don't know how to assign GPU's to a specific replica. I assume that each replica should have its own GPU, right, or can a GPU be shared by multiple replicas? Ideally I'd like to be able to use 1 GPU node with 4 GPU's to run 16 replicas, or 4 GPU nodes for 16 gpu's, one per replica, if the former isn't possible.
Sharing GPUs between replicas is ok. If you have multiple replica per node you can’t actually prevent it. You can of course oversubscribe your nodes as long as memory is sufficient, you just need to use #replicas*2 processes. You shouldn’t need to care about GPU binding at all.
Does anyone have any advice to resolve the problem I'm running into?
The warning and error you observe might be due starting a high temperature run without a leading minimization run. Try minimizing the system before starting the REMD. I actually extended the replica.namd script by a configurable minimization feature to get around such issues.
This archive was generated by hypermail 2.1.6 : Thu Dec 31 2015 - 23:21:57 CST