Re: not getting NAMD multicopy simulation started

From: René Hafner TUK (hamburge_at_physik.uni-kl.de)
Date: Wed Nov 25 2020 - 11:35:24 CST

Dear Joshua,

I tried the same as you showed below on two different clusters on my hand.

On both cases I get a segfault with a precompiled version of
NAMD_2.14_Linux-x86_64-verbs-smp-CUDA

I will give it a try with an self-compiled version.

See me slurm submission script below

"""

#!/bin/sh
#SBATCH --job-name=SimID0145_w3_molDIC_s1_run2_copy4_testntasks
#SBATCH --mem=40g
#SBATCH --partition=gpu
#timing #sSBATCH -t [min] OR -t [days-hh:mm:ss]
#SBATCH -t 0-01:00:00
#sending mail
# mail alert at start, end and abortion of execution
#SBATCH --mail-type=ALL
#output file
#SBATCH -o slurmoutput/JOBID_%j.out
#errorfile
#SBATCH -e slurmoutput/JOBID_%j.err
#SBATCH --nodes=1
#SBATCH --ntasks=3
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:t4:3
#SBATCH --exclusive
#SBATCH --nodelist=node212

script_path=$(pwd)
conf_file=namd_SimID0145_abf_molDIC_siteID1_run2_copy4.window3.conf
log_path=$script_path
log_file=log_SimID0145_abf_molDIC_siteID1_run2_copy4.window3.replica%d.log

NAMD214pathCudaMultiCopy="/p/opt/BioSorb/hiwi/software/namd/namd_binaries_benchmark/NAMD_2.14_Linux-x86_64-verbs-smp-CUDA"

srun $NAMD214pathCudaMultiCopy/namd2 +ppn 8 +replicas 3
$script_path/$conf_file +ignoresharing +stdout $log_path/$log_file

"""

On 11/25/2020 5:11 PM, Josh Vermaas wrote:
> Hi Rene,
>
> The expedient thing to do is usually just to go with +ignoresharing.
> It *should* also be possible for this to work if +ppn is set
> correctly. This is a runscript that I've used in a slurm environment
> to correctly map GPUs on a 2 socket 4-GPU system, where I was
> oversubscribing the GPUs (64 replicas, only 32 GPUs):
>
> #!/bin/bash
>
> #SBATCH --gres=gpu:4
>
> #SBATCH --nodes=8
>
> #SBATCH --ntasks=64
>
> #SBATCH --cpus-per-task=6
>
> #SBATCH --gpu-bind=closest
>
> #SBATCH --time=4:0:0
>
> set-x
>
> module load gompi/2020a CUDA
>
>
> cd$SLURM_SUBMIT_DIR
>
> #This isn't obvious, but this is a Linux-x86_64-ucx-smp-CUDA build
> compiled from source.
>
> srun $HOME/NAMD_2.14_Source/Linux-x86_64-g++/namd2
> +ppn6+replicas64run0.namd +stdout%d/run0.%d.log
>
>
>
> It worked out that each replica was able to have 6 dedicated cores per
> replica, which is where the +ppn 6 came from. Thus, even though each
> replica saw multiple GPUs (gpu-bind closest meant that each replica
> saw the 2 GPUs closest to the CPU the 6 cores came from, rather than
> all 4 on the node), I didn't need to specify devices or +ignoresharing.
>
>
> Hope this helps!
>
>
> -Josh
>
>
>
> On Wed, Nov 25, 2020 at 6:47 AM René Hafner TUK
> <hamburge_at_physik.uni-kl.de <mailto:hamburge_at_physik.uni-kl.de>> wrote:
>
> Update:
>
>     I am ONLY able to run both NAMD2.13 and NAMD3alpha7
> netlrts-smp-CUDA versions with
>
>         +p2 +replicas 2, i.e. 1 core per replica.
>
> *    But as soon as I use cores more than 1core per replica it fails.*
>
>
> Anyone ever experienced that?
>
> Any hints are appreciated!
>
>
> Kind regards
>
> René
>
>
> On 11/23/2020 2:22 PM, René Hafner TUK wrote:
>> Dear all,
>>
>>
>>  I am trying to get an (e)ABF simulation running with multi-copy
>> algorithm on a multiGPU node.
>>
>> I tried as describe in
>> http://www.ks.uiuc.edu/Research/namd/2.13/notes.html
>> <http://www.ks.uiuc.edu/Research/namd/2.13/notes.html> :
>>
>>         charmrun ++local  namd2 myconf_file.conf +p16 +replicas 2
>> +stdout logfile%d.log
>>
>>
>> I am using the precompiled binaries from the Download page: NAMD
>> 2.13 Linux-x86_64-netlrts-smp-CUDA (Multi-copy algorithms, single
>> process per copy)
>>
>> And for both NAMD2.13 and NAMD2.14 I get the error:
>>
>> FATAL ERROR: Number of devices (2) is not a multiple of number of
>> processes (8).  Sharing devices between processes is inefficient
>

-- 
--
Dipl.-Phys. René Hafner
TU Kaiserslautern
Germany

This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:10 CST