Re: replica exchange and GPU acceleration

From: Mitchell Gleed (aliigleed16_at_gmail.com)
Date: Sat Jul 04 2015 - 10:49:22 CDT

I'd like to use 16 cores, 1 node, 4 gpus, 16 replicas. For most of the
testing, I've tried with 4 cores, 1 & 4 nodes, 4 gpus, 4 replicas with the
same trouble.

Example slurm request:
salloc --nodes=1 --ntasks=4 --gres=gpu:4 --mem-per-cpu=2048M -t24:00:00

Example namd launch scheme:
#!/bin/bash
cd NAMD_2.10_Source/lib/replica/example
export procs=4
module load namd/2.10_cuda-6.5.14_openmpi-1.6.5_gnu-4.8.2
mpirun -np $procs $(which namd2) +replicas $procs job0.conf +stdout
output/%d/job0.%d.log # relevant namd/replica scripts use $::env(procs)
replicas

Again, I can get these things to work fine following this pattern on a
non-CUDA build, but with the CUDA build I run into errors. I've used the
CUDA build fine for non-REMD simulations.

Mitch

On Fri, Jul 3, 2015 at 1:16 AM, Norman Geist <norman.geist_at_uni-greifswald.de
> wrote:

> Can you tell how many cores, nodes, gpus and replicas you are trying to
> use? There are some requirements on those numbers to be multiples of each
> other.
>
>
>
> Norman Geist.
>
>
>
> *From:* Mitchell Gleed [mailto:aliigleed16_at_gmail.com]
> *Sent:* Wednesday, July 01, 2015 6:02 PM
> *To:* NAMD list; Norman Geist
> *Subject:* Re: namd-l: replica exchange and GPU acceleration
>
>
>
> Thank you Norman. The info about sharing GPU's is very helpful.
>
>
>
> I didn't have any success upon adding a minimization line before REMD to
> either the alanin_base.conf or the replica.namd scripts. In either case,
> I get errors like "0:18(1) atom 18 found 6 exclusions but expected 4" at
> nearly every minimization step. Although minimization runs successfully
> despite these errors, simulations just crash at step 0 with velocity
> errors. I've run into the same error in /lib/replica/example and
> /lib/replica/umbrella test cases.
>
>
>
> I found another thread where they talk about a similar warning (atom X
> found Y exclusions but expected Z) and they found that their error was an
> improper resource request for their nodes, but I haven't had any issues
> before with running standard GPU-accelerated jobs, just these REMD-type
> jobs. And I've been quite cautious to request the right resources in SLURM
> and make sure the command to launch NAMD matches.
>
>
>
> If anyone has any other ideas, let me know. If you wouldn't mind, Norman,
> would you mind sharing a bit more information for how you get it to run?
> (mpi type/version, architecture, commands to launch NAMD for REMD w/ GPU,
> etc.)
>
>
>
> Thanks again,
>
>
>
> Mitch
>
>
>
> On Wed, Jul 1, 2015 at 12:13 AM, Norman Geist <
> norman.geist_at_uni-greifswald.de> wrote:
>
>
>
> *From:* owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] *On
> Behalf Of *Mitchell Gleed
> *Sent:* Wednesday, July 01, 2015 1:09 AM
> *To:* NAMD list
> *Subject:* namd-l: replica exchange and GPU acceleration
>
>
>
> Hi,
>
>
>
> Hey
>
>
>
> I'm struggling to get GPU-accelerated replica-exchange jobs to run. I
> thought that maybe this wasn't possible, but I found some older posts in
> which some users reported success with it.
>
>
>
> In the NAMD_2.10/lib/replica/umbrella directory, after tuning a couple
> parameters in the configuration files, if I try to launch a replica
> exchange job with a standard mpi NAMD build (mpirun -np 4 namd2 ++replicas
> 4 ... etc) things work fine. However, when I jump on a node with 4 GPU's
> (2* tesla k80) and use a CUDA-enabled mpi NAMD build (mpirun -np 4 namd2
> +idlepoll +replicas 4 ...), I get atom velocity errors at startup. These
> are preceded by several warnings, after load balancing, similar to the
> following: "0:18(1) atom 18 found 6 exclusions but expected 4"
>
>
>
> I’ve run a lot of GPU REMD jobs successfully, so it’s possible and running
> fine.
>
>
>
> The CUDA-enabled mpi build runs fine for non-REMD runs and provides great
> acceleration to my simulations. I recognize the test case here might not be
> the best-suited for CUDA acceleration due to its small size, but it should
> at least run, right?
>
>
>
> Now, I notice the log files for each replica all say "Pe 0 physical rank 0
> binding to CUDA device 0," as if each replica is trying to use the same GPU
> (device 0). (Persists even with +devices 0,1,2,3.) I suspect I could solve
> the problem if I got each of four replicas to bind to different GPU's, but
> I don't know how to assign GPU's to a specific replica. I assume that each
> replica should have its own GPU, right, or can a GPU be shared by multiple
> replicas? Ideally I'd like to be able to use 1 GPU node with 4 GPU's to run
> 16 replicas, or 4 GPU nodes for 16 gpu's, one per replica, if the former
> isn't possible.
>
>
>
> Sharing GPUs between replicas is ok. If you have multiple replica per node
> you can’t actually prevent it. You can of course oversubscribe your nodes
> as long as memory is sufficient, you just need to use #replicas*2
> processes. You shouldn’t need to care about GPU binding at all.
>
>
>
> Does anyone have any advice to resolve the problem I'm running into?
>
>
>
> The warning and error you observe might be due starting a high temperature
> run without a leading minimization run. Try minimizing the system before
> starting the REMD. I actually extended the replica.namd script by a
> configurable minimization feature to get around such issues.
>
>
>
> Regards
>
>
>
> Norman Geist
>
>
>
> Thanks!
>
>
>
> Mitch Gleed
>
>
>

This archive was generated by hypermail 2.1.6 : Thu Dec 31 2015 - 23:21:57 CST