Re: not getting NAMD multicopy simulation started

From: René Hafner TUK (hamburge_at_physik.uni-kl.de)
Date: Wed Nov 25 2020 - 18:20:18 CST

Hi Josh,

    it seems it's working now with a self-compiled version of NAMD2.14
    *Linux-x86_64-icc-netlrts smp cuda
    *

    *using charmrun *to start it together with the *++ppn flag*

    as indicated in a previous post of Jeffrey Comer
    http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l.2016-2017/1721.html

    My script is:

        #SBATCH --mem=40g
        #SBATCH --partition=gpuidle
        #SBATCH -t 0-01:00:00
        #SBATCH --nodes=1
        #SBATCH --ntasks-per-node=24 # multicore or normal charmrun
        #SBATCH --nodelist=gpu013

        script_path=$(pwd)
        conf_file=namd_SimID0145_abf_molDIC_siteID1_run2_copy7.window3.conf
        log_path=$script_path
        log_file=log_SimID0145_abf_molDIC_siteID1_run2_copy7.window3.replica%d.log

        ElweNAMD214selfcompiledCudaMulticopy="/home/hamburge/software/NAMD_2.14_Source/Linux-x86_64-icc-netlrts"

        cd $ElweNAMD214selfcompiledCudaMulticopy

             module load nvidia/10.0
             module load intel/2018
                 ./charmrun ++verbose ++local ./namd2 *++ppn 6* +p24
        +replicas 4 $script_path/$conf_file +stdout $log_path/$log_file

        # the selfcompiled version also works without any further flags*
        *

    Although it starts running with precompiled NAMD3.alpha7 (using same
    startup method as above)

    here a FATAL error occurs with the colvars module complaining about
    replicas required (though they are not detected by colvars whysoever).

        colvars: Collective variables initialized, 12 in total.
        colvars:
        ----------------------------------------------------------------------
        colvars:   Initializing a new "abf" instance.
        colvars:   # name = "abf1" [default]
        colvars:   # colvars = { cvdistz }
        colvars:   # outputEnergy = on
        colvars:   # timeStepFactor = 1 [default]
        colvars:   # applyBias = on [default]
        colvars:   # updateBias = on [default]
        colvars:   # hideJacobian = off [default]
        colvars:   Jacobian (geometric) forces will be included in
        reported free energy gradients.
        colvars:   # fullSamples = 5000
        colvars:   # inputPrefix =  [default]
        colvars:   # outputFreq = 10000
        colvars:   # historyFreq = 10000
        colvars:   # shared = on
        colvars: Error: shared ABF requires more than one replica.
        FATAL ERROR: Error in the collective variables module (see above
        for details)

    Therefore I will try compiling NAMD3 too, tomorrow.

I appreciate your help!

Kind regards

René

On 11/25/2020 8:22 PM, Josh Vermaas wrote:
> Segfaults might be because the verbs build depends on a library that
> might not be installed on your system. Are there any error messages
> that come first? Otherwise, the netlrts version with charmrun should
> be similar in behavior, even if it doesn't integrate nicely with srun.
>
> -Josh
>
> On 11/25/20 10:35 AM, René Hafner TUK wrote:
>>
>> Dear Joshua,
>>
>> I tried the same as you showed below on two different clusters on my
>> hand.
>>
>> On both cases I get a segfault with a precompiled version of
>> NAMD_2.14_Linux-x86_64-verbs-smp-CUDA
>>
>> I will give it a try with an self-compiled version.
>>
>>
>> See me slurm submission script below
>>
>> """
>>
>> #!/bin/sh
>> #SBATCH --job-name=SimID0145_w3_molDIC_s1_run2_copy4_testntasks
>> #SBATCH --mem=40g
>> #SBATCH --partition=gpu
>> #timing #sSBATCH -t [min] OR -t [days-hh:mm:ss]
>> #SBATCH -t 0-01:00:00
>> #sending mail
>> # mail alert at start, end and abortion of execution
>> #SBATCH --mail-type=ALL
>> #output file
>> #SBATCH -o slurmoutput/JOBID_%j.out
>> #errorfile
>> #SBATCH -e slurmoutput/JOBID_%j.err
>> #SBATCH --nodes=1
>> #SBATCH --ntasks=3
>> #SBATCH --cpus-per-task=8
>> #SBATCH --gres=gpu:t4:3
>> #SBATCH --exclusive
>> #SBATCH --nodelist=node212
>>
>>
>> script_path=$(pwd)
>> conf_file=namd_SimID0145_abf_molDIC_siteID1_run2_copy4.window3.conf
>> log_path=$script_path
>> log_file=log_SimID0145_abf_molDIC_siteID1_run2_copy4.window3.replica%d.log
>>
>> NAMD214pathCudaMultiCopy="/p/opt/BioSorb/hiwi/software/namd/namd_binaries_benchmark/NAMD_2.14_Linux-x86_64-verbs-smp-CUDA"
>>
>> srun $NAMD214pathCudaMultiCopy/namd2 +ppn 8 +replicas 3
>> $script_path/$conf_file +ignoresharing +stdout $log_path/$log_file
>>
>> """
>>
>>
>>
>> On 11/25/2020 5:11 PM, Josh Vermaas wrote:
>>> Hi Rene,
>>>
>>> The expedient thing to do is usually just to go with +ignoresharing.
>>> It *should* also be possible for this to work if +ppn is set
>>> correctly. This is a runscript that I've used in a slurm environment
>>> to correctly map GPUs on a 2 socket 4-GPU system, where I was
>>> oversubscribing the GPUs (64 replicas, only 32 GPUs):
>>>
>>> #!/bin/bash
>>>
>>> #SBATCH --gres=gpu:4
>>>
>>> #SBATCH --nodes=8
>>>
>>> #SBATCH --ntasks=64
>>>
>>> #SBATCH --cpus-per-task=6
>>>
>>> #SBATCH --gpu-bind=closest
>>>
>>> #SBATCH --time=4:0:0
>>>
>>> set-x
>>>
>>> module load gompi/2020a CUDA
>>>
>>>
>>> cd$SLURM_SUBMIT_DIR
>>>
>>> #This isn't obvious, but this is a Linux-x86_64-ucx-smp-CUDA build
>>> compiled from source.
>>>
>>> srun $HOME/NAMD_2.14_Source/Linux-x86_64-g++/namd2
>>> +ppn6+replicas64run0.namd +stdout%d/run0.%d.log
>>>
>>>
>>>
>>> It worked out that each replica was able to have 6 dedicated cores
>>> per replica, which is where the +ppn 6 came from. Thus, even though
>>> each replica saw multiple GPUs (gpu-bind closest meant that each
>>> replica saw the 2 GPUs closest to the CPU the 6 cores came from,
>>> rather than all 4 on the node), I didn't need to specify devices
>>> or +ignoresharing.
>>>
>>>
>>> Hope this helps!
>>>
>>>
>>> -Josh
>>>
>>>
>>>
>>> On Wed, Nov 25, 2020 at 6:47 AM René Hafner TUK
>>> <hamburge_at_physik.uni-kl.de <mailto:hamburge_at_physik.uni-kl.de>> wrote:
>>>
>>> Update:
>>>
>>>     I am ONLY able to run both NAMD2.13 and NAMD3alpha7
>>> netlrts-smp-CUDA versions with
>>>
>>>         +p2 +replicas 2, i.e. 1 core per replica.
>>>
>>> *    But as soon as I use cores more than 1core per replica it
>>> fails.*
>>>
>>>
>>> Anyone ever experienced that?
>>>
>>> Any hints are appreciated!
>>>
>>>
>>> Kind regards
>>>
>>> René
>>>
>>>
>>> On 11/23/2020 2:22 PM, René Hafner TUK wrote:
>>>> Dear all,
>>>>
>>>>
>>>>  I am trying to get an (e)ABF simulation running with
>>>> multi-copy algorithm on a multiGPU node.
>>>>
>>>> I tried as describe in
>>>> http://www.ks.uiuc.edu/Research/namd/2.13/notes.html
>>>> <http://www.ks.uiuc.edu/Research/namd/2.13/notes.html> :
>>>>
>>>>         charmrun ++local  namd2 myconf_file.conf +p16 +replicas
>>>> 2 +stdout logfile%d.log
>>>>
>>>>
>>>> I am using the precompiled binaries from the Download page:
>>>> NAMD 2.13 Linux-x86_64-netlrts-smp-CUDA (Multi-copy algorithms,
>>>> single process per copy)
>>>>
>>>> And for both NAMD2.13 and NAMD2.14 I get the error:
>>>>
>>>> FATAL ERROR: Number of devices (2) is not a multiple of number
>>>> of processes (8).  Sharing devices between processes is inefficient
>>>
>> --
>> --
>> Dipl.-Phys. René Hafner
>> TU Kaiserslautern
>> Germany
>

-- 
--
Dipl.-Phys. René Hafner
TU Kaiserslautern
Germany

This archive was generated by hypermail 2.1.6 : Thu Dec 31 2020 - 23:17:14 CST