Re: not getting NAMD multicopy simulation started

From: Josh Vermaas (joshua.vermaas_at_gmail.com)
Date: Wed Nov 25 2020 - 10:11:58 CST

Hi Rene,

The expedient thing to do is usually just to go with +ignoresharing. It
*should* also be possible for this to work if +ppn is set correctly. This
is a runscript that I've used in a slurm environment to correctly map GPUs
on a 2 socket 4-GPU system, where I was oversubscribing the GPUs (64
replicas, only 32 GPUs):

#!/bin/bash

#SBATCH --gres=gpu:4

#SBATCH --nodes=8

#SBATCH --ntasks=64

#SBATCH --cpus-per-task=6

#SBATCH --gpu-bind=closest

#SBATCH --time=4:0:0

set -x

module load gompi/2020a CUDA

cd $SLURM_SUBMIT_DIR

#This isn't obvious, but this is a Linux-x86_64-ucx-smp-CUDA build compiled
from source.

srun $HOME/NAMD_2.14_Source/Linux-x86_64-g++/namd2 +ppn 6 +replicas 64
run0.namd +stdout %d/run0.%d.log

It worked out that each replica was able to have 6 dedicated cores per
replica, which is where the +ppn 6 came from. Thus, even though each
replica saw multiple GPUs (gpu-bind closest meant that each replica saw the
2 GPUs closest to the CPU the 6 cores came from, rather than all 4 on the
node), I didn't need to specify devices or +ignoresharing.

Hope this helps!

-Josh

On Wed, Nov 25, 2020 at 6:47 AM René Hafner TUK <hamburge_at_physik.uni-kl.de>
wrote:

> Update:
>
> I am ONLY able to run both NAMD2.13 and NAMD3alpha7 netlrts-smp-CUDA
> versions with
>
> +p2 +replicas 2, i.e. 1 core per replica.
>
> * But as soon as I use cores more than 1core per replica it fails.*
>
>
> Anyone ever experienced that?
>
> Any hints are appreciated!
>
>
> Kind regards
>
> René
>
>
> On 11/23/2020 2:22 PM, René Hafner TUK wrote:
>
> Dear all,
>
>
> I am trying to get an (e)ABF simulation running with multi-copy algorithm
> on a multiGPU node.
>
> I tried as describe in
> http://www.ks.uiuc.edu/Research/namd/2.13/notes.html :
>
> charmrun ++local namd2 myconf_file.conf +p16 +replicas 2 +stdout
> logfile%d.log
>
>
> I am using the precompiled binaries from the Download page: NAMD 2.13
> Linux-x86_64-netlrts-smp-CUDA (Multi-copy algorithms, single process per
> copy)
>
> And for both NAMD2.13 and NAMD2.14 I get the error:
>
> FATAL ERROR: Number of devices (2) is not a multiple of number of
> processes (8). Sharing devices between processes is inefficient. Specify
> +ignoresharing (each process uses all visible devices) if not all devices
> are visible to each process, otherwise adjust number of processes to evenly
> divide number of devices, specify subset of devices with +devices argument
> (e.g., +devices 0,2), or multiply list shared devices (e.g., +devices
> 0,1,2,0).
>
> But even with using +devices 0,1 !
>
> I obtain the same error. Why should the number of devices be a multiple of
> the number of processes at all?
>
> Shouldn't it be the otherway around? 8 cores + 1 gpu PER replica for my
> example
>
> Can anyone give me some support here?
>
>
> Kind regards
>
> René Hafner
>
> --
> --
> Dipl.-Phys. René Hafner
> TU Kaiserslautern
> Germany
>
>

This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:10 CST