Re: replica exchange and GPU acceleration

From: Mitchell Gleed (aliigleed16_at_gmail.com)
Date: Wed Jul 01 2015 - 11:01:50 CDT

Thank you Norman. The info about sharing GPU's is very helpful.

I didn't have any success upon adding a minimization line before REMD to
either the alanin_base.conf or the replica.namd scripts. In either case, I
get errors like "0:18(1) atom 18 found 6 exclusions but expected 4" at
nearly every minimization step. Although minimization runs successfully
despite these errors, simulations just crash at step 0 with velocity
errors. I've run into the same error in /lib/replica/example and
/lib/replica/umbrella test cases.

I found another thread where they talk about a similar warning (atom X
found Y exclusions but expected Z) and they found that their error was an
improper resource request for their nodes, but I haven't had any issues
before with running standard GPU-accelerated jobs, just these REMD-type
jobs. And I've been quite cautious to request the right resources in SLURM
and make sure the command to launch NAMD matches.

If anyone has any other ideas, let me know. If you wouldn't mind, Norman,
would you mind sharing a bit more information for how you get it to run?
(mpi type/version, architecture, commands to launch NAMD for REMD w/ GPU,
etc.)

Thanks again,

Mitch

On Wed, Jul 1, 2015 at 12:13 AM, Norman Geist <
norman.geist_at_uni-greifswald.de> wrote:

>
>
> *From:* owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] *On
> Behalf Of *Mitchell Gleed
> *Sent:* Wednesday, July 01, 2015 1:09 AM
> *To:* NAMD list
> *Subject:* namd-l: replica exchange and GPU acceleration
>
>
>
> Hi,
>
>
>
> Hey
>
>
>
> I'm struggling to get GPU-accelerated replica-exchange jobs to run. I
> thought that maybe this wasn't possible, but I found some older posts in
> which some users reported success with it.
>
>
>
> In the NAMD_2.10/lib/replica/umbrella directory, after tuning a couple
> parameters in the configuration files, if I try to launch a replica
> exchange job with a standard mpi NAMD build (mpirun -np 4 namd2 ++replicas
> 4 ... etc) things work fine. However, when I jump on a node with 4 GPU's
> (2* tesla k80) and use a CUDA-enabled mpi NAMD build (mpirun -np 4 namd2
> +idlepoll +replicas 4 ...), I get atom velocity errors at startup. These
> are preceded by several warnings, after load balancing, similar to the
> following: "0:18(1) atom 18 found 6 exclusions but expected 4"
>
>
>
> I’ve run a lot of GPU REMD jobs successfully, so it’s possible and running
> fine.
>
>
>
> The CUDA-enabled mpi build runs fine for non-REMD runs and provides great
> acceleration to my simulations. I recognize the test case here might not be
> the best-suited for CUDA acceleration due to its small size, but it should
> at least run, right?
>
>
>
> Now, I notice the log files for each replica all say "Pe 0 physical rank 0
> binding to CUDA device 0," as if each replica is trying to use the same GPU
> (device 0). (Persists even with +devices 0,1,2,3.) I suspect I could solve
> the problem if I got each of four replicas to bind to different GPU's, but
> I don't know how to assign GPU's to a specific replica. I assume that each
> replica should have its own GPU, right, or can a GPU be shared by multiple
> replicas? Ideally I'd like to be able to use 1 GPU node with 4 GPU's to run
> 16 replicas, or 4 GPU nodes for 16 gpu's, one per replica, if the former
> isn't possible.
>
>
>
> Sharing GPUs between replicas is ok. If you have multiple replica per node
> you can’t actually prevent it. You can of course oversubscribe your nodes
> as long as memory is sufficient, you just need to use #replicas*2
> processes. You shouldn’t need to care about GPU binding at all.
>
>
>
> Does anyone have any advice to resolve the problem I'm running into?
>
>
>
> The warning and error you observe might be due starting a high temperature
> run without a leading minimization run. Try minimizing the system before
> starting the REMD. I actually extended the replica.namd script by a
> configurable minimization feature to get around such issues.
>
>
>
> Regards
>
>
>
> Norman Geist
>
>
>
> Thanks!
>
>
>
> Mitch Gleed
>

This archive was generated by hypermail 2.1.6 : Tue Dec 27 2016 - 23:21:12 CST