Re: not getting NAMD multicopy simulation started

From: Josh Vermaas (joshua.vermaas_at_gmail.com)
Date: Wed Nov 25 2020 - 13:22:55 CST

Segfaults might be because the verbs build depends on a library that
might not be installed on your system. Are there any error messages that
come first? Otherwise, the netlrts version with charmrun should be
similar in behavior, even if it doesn't integrate nicely with srun.

-Josh

On 11/25/20 10:35 AM, René Hafner TUK wrote:
>
> Dear Joshua,
>
> I tried the same as you showed below on two different clusters on my hand.
>
> On both cases I get a segfault with a precompiled version of
> NAMD_2.14_Linux-x86_64-verbs-smp-CUDA
>
> I will give it a try with an self-compiled version.
>
>
> See me slurm submission script below
>
> """
>
> #!/bin/sh
> #SBATCH --job-name=SimID0145_w3_molDIC_s1_run2_copy4_testntasks
> #SBATCH --mem=40g
> #SBATCH --partition=gpu
> #timing #sSBATCH -t [min] OR -t [days-hh:mm:ss]
> #SBATCH -t 0-01:00:00
> #sending mail
> # mail alert at start, end and abortion of execution
> #SBATCH --mail-type=ALL
> #output file
> #SBATCH -o slurmoutput/JOBID_%j.out
> #errorfile
> #SBATCH -e slurmoutput/JOBID_%j.err
> #SBATCH --nodes=1
> #SBATCH --ntasks=3
> #SBATCH --cpus-per-task=8
> #SBATCH --gres=gpu:t4:3
> #SBATCH --exclusive
> #SBATCH --nodelist=node212
>
>
> script_path=$(pwd)
> conf_file=namd_SimID0145_abf_molDIC_siteID1_run2_copy4.window3.conf
> log_path=$script_path
> log_file=log_SimID0145_abf_molDIC_siteID1_run2_copy4.window3.replica%d.log
>
> NAMD214pathCudaMultiCopy="/p/opt/BioSorb/hiwi/software/namd/namd_binaries_benchmark/NAMD_2.14_Linux-x86_64-verbs-smp-CUDA"
>
> srun $NAMD214pathCudaMultiCopy/namd2 +ppn 8 +replicas 3
> $script_path/$conf_file +ignoresharing +stdout $log_path/$log_file
>
> """
>
>
>
> On 11/25/2020 5:11 PM, Josh Vermaas wrote:
>> Hi Rene,
>>
>> The expedient thing to do is usually just to go with +ignoresharing.
>> It *should* also be possible for this to work if +ppn is set
>> correctly. This is a runscript that I've used in a slurm environment
>> to correctly map GPUs on a 2 socket 4-GPU system, where I was
>> oversubscribing the GPUs (64 replicas, only 32 GPUs):
>>
>> #!/bin/bash
>>
>> #SBATCH --gres=gpu:4
>>
>> #SBATCH --nodes=8
>>
>> #SBATCH --ntasks=64
>>
>> #SBATCH --cpus-per-task=6
>>
>> #SBATCH --gpu-bind=closest
>>
>> #SBATCH --time=4:0:0
>>
>> set-x
>>
>> module load gompi/2020a CUDA
>>
>>
>> cd$SLURM_SUBMIT_DIR
>>
>> #This isn't obvious, but this is a Linux-x86_64-ucx-smp-CUDA build
>> compiled from source.
>>
>> srun $HOME/NAMD_2.14_Source/Linux-x86_64-g++/namd2
>> +ppn6+replicas64run0.namd +stdout%d/run0.%d.log
>>
>>
>>
>> It worked out that each replica was able to have 6 dedicated cores
>> per replica, which is where the +ppn 6 came from. Thus, even though
>> each replica saw multiple GPUs (gpu-bind closest meant that each
>> replica saw the 2 GPUs closest to the CPU the 6 cores came from,
>> rather than all 4 on the node), I didn't need to specify devices
>> or +ignoresharing.
>>
>>
>> Hope this helps!
>>
>>
>> -Josh
>>
>>
>>
>> On Wed, Nov 25, 2020 at 6:47 AM René Hafner TUK
>> <hamburge_at_physik.uni-kl.de <mailto:hamburge_at_physik.uni-kl.de>> wrote:
>>
>> Update:
>>
>>     I am ONLY able to run both NAMD2.13 and NAMD3alpha7
>> netlrts-smp-CUDA versions with
>>
>>         +p2 +replicas 2, i.e. 1 core per replica.
>>
>> *    But as soon as I use cores more than 1core per replica it
>> fails.*
>>
>>
>> Anyone ever experienced that?
>>
>> Any hints are appreciated!
>>
>>
>> Kind regards
>>
>> René
>>
>>
>> On 11/23/2020 2:22 PM, René Hafner TUK wrote:
>>> Dear all,
>>>
>>>
>>>  I am trying to get an (e)ABF simulation running with multi-copy
>>> algorithm on a multiGPU node.
>>>
>>> I tried as describe in
>>> http://www.ks.uiuc.edu/Research/namd/2.13/notes.html :
>>>
>>>         charmrun ++local  namd2 myconf_file.conf +p16 +replicas
>>> 2 +stdout logfile%d.log
>>>
>>>
>>> I am using the precompiled binaries from the Download page: NAMD
>>> 2.13 Linux-x86_64-netlrts-smp-CUDA (Multi-copy algorithms,
>>> single process per copy)
>>>
>>> And for both NAMD2.13 and NAMD2.14 I get the error:
>>>
>>> FATAL ERROR: Number of devices (2) is not a multiple of number
>>> of processes (8).  Sharing devices between processes is inefficient
>>
> --
> --
> Dipl.-Phys. René Hafner
> TU Kaiserslautern
> Germany

This archive was generated by hypermail 2.1.6 : Thu Dec 31 2020 - 23:17:14 CST