Re: How to integrate multiple walkers 2D metadynamics results?

From: Giacomo Fiorin (giacomo.fiorin_at_gmail.com)
Date: Fri Nov 15 2019 - 13:12:00 CST

Hi Sebastian, with 2.13 keep in mind that the PMF written will count the
local replica twice. To get the correct one, you can just download the
precompiled nightly build and run it for zero steps on one processor.

As for launching all replicas simultaneously, this will make I/O issues
much more fragile. Adding a "sleep 5s" command between launching two
replicas could help.

If you can't use a more recent version of NAMD, consider increasing
replicaUpdateFrequency even further, but only that. I definitely did not
recommend changing all the output parameters to have the same value...
Making everything written every 10000 steps will stress the file system
more without need.

Lastly, because you are running on 2 processors for each replicas, why
don't you just download yourself the "multicore" nightly build and use
that? The "multicore" version is not MPI capable, but it can definitely
use up efficiently all the processors on each node. You just need to ask
the sysadmins for the best way to launch each replica (i.e. each copy of
NAMD) on a different node, and you won't need the added complication of
figuring out the correct MPI options.

When it comes to MPI, NAMD can be built to use it quite efficiently but it
also becomes tightly integrated with your cluster setup, the details of
which we can't help you with. But pretty much everyone on this list is
familiar with the "multicore" build, which runs over multiple processors of
a single node and is independent of MPI implementation or inter-node
network.

Giacomo

On Fri, Nov 15, 2019 at 1:41 PM Sebastian S <
thecromicusproductions_at_gmail.com> wrote:

> By the way, I'm using version 2.13, as the administrators of my network
> haven't installed the new one yet
>
> On Fri, Nov 15, 2019 at 1:34 PM Sebastian S <
> thecromicusproductions_at_gmail.com> wrote:
>
>> I tried in the same node and I'm getting the same errors. The funny thing
>> is that I can run 4 replicas without problems, but when I try 10 they start
>> failing
>>
>> module load namd
>> mpirun -np 2 namd2 testres.rep1.namd > s1.0.log &
>> mpirun -np 2 namd2 testres.rep2.namd > s2.0.log &
>> mpirun -np 2 namd2 testres.rep3.namd > s3.0.log &
>> mpirun -np 2 namd2 testres.rep4.namd > s4.0.log &
>> mpirun -np 2 namd2 testres.rep5.namd > s5.0.log &
>> mpirun -np 2 namd2 testres.rep6.namd > s6.0.log &
>> mpirun -np 2 namd2 testres.rep7.namd > s7.0.log &
>> mpirun -np 2 namd2 testres.rep8.namd > s8.0.log &
>> mpirun -np 2 namd2 testres.rep9.namd > s9.0.log &
>> mpirun -np 2 namd2 testres.rep10.namd > s10.0.log &
>> wait
>>
>>
>>
>> On Fri, Nov 15, 2019 at 12:54 PM Victor Kwan <vkwan8_at_uwo.ca> wrote:
>>
>>> Try with running the 12 replicas on the same node to see if the problem
>>> relates to MPI?
>>>
>>> Victor
>>>
>>> On Fri, Nov 15, 2019 at 12:26 PM Canal de Sebassen <
>>> thecromicusproductions_at_gmail.com> wrote:
>>>
>>>> I have another question about these simulations. I started running some
>>>> yesterday and:
>>>>
>>>> 1) initially some walkers do not start at all. I get messages like
>>>> colvars: Metadynamics bias "metadynamics1": failed to read the file
>>>> "metadynamics1.rep1.files.txt": will try again after 10000 steps.
>>>> and in the same step the walker reads the other replicas and ends with
>>>> colvars: Metadynamics bias "metadynamics1": reading the state of
>>>> replica "rep1" from file "".
>>>> colvars: Error: in reading state configuration for "metadynamics" bias
>>>> "metadynamics1" at position -1 in stream.
>>>>
>>>> 2) others, they run for a while but then give me a message
>>>> colvars: Error: in reading state configuration for "metadynamics" bias
>>>> "metadynamics1" at position -1 in stream.
>>>> FATAL ERROR: Error in the collective variables module: exiting.
>>>>
>>>> 3) in the end, I only get 3 walkers to work, with the other 9 I sent
>>>> left for dead. I'm running these simulations in my local cluster, with the
>>>> following code
>>>>
>>>> #!/bin/bash
>>>> #$ -pe mpi-24 288 # Specify parallel environment and legal core size
>>>> #$ -q long # Specify queue
>>>> #$ -N Trial1 # Specify job name
>>>>
>>>> TASK=0
>>>> cat $PE_HOSTFILE | while read -r line; do
>>>> host=`echo $line|cut -f1 -d" "|cut -f1 -d"."`
>>>> echo $host >> hostfile
>>>> done
>>>> hostfile="./hostfile"
>>>> while IFS= read -r host
>>>> do
>>>> let "TASK+=1"
>>>> /usr/kerberos/bin/rsh -F $host -n "uname -a; echo $TASK; cd
>>>> XXXXXXXXX; pwd; module load namd; mpirun -np 24 namd2
>>>> testres.rep$TASK.namd > s$TASK.0.log ; exit" &
>>>> done < $hostfile
>>>> wait
>>>> rm ./hostfile
>>>>
>>>> Am I doing something wrong? Currently the times of my colvars are
>>>> colvarsTrajFrequency 10000
>>>> metadynamics {
>>>> colvars d1 d2
>>>>
>>>> useGrids on
>>>> hillWeight 0.05
>>>> newHillFrequency 10000
>>>> dumpFreeEnergyFile on
>>>> dumpPartialFreeEnergyFile on
>>>> saveFreeEnergyFile on
>>>> writeHillsTrajectory on
>>>>
>>>> multipleReplicas yes
>>>> replicaID rep9
>>>> replicasRegistry replicas.registry.txt
>>>> replicaUpdateFrequency 10000
>>>>
>>>>
>>>> and my namd outputs are
>>>> numSteps 25000000
>>>> outputEnergies 10000
>>>> outputPressure 10000
>>>> outputTiming 10000
>>>> xstFreq 10000
>>>> dcdFreq 10000
>>>> restartFreq 10000
>>>>
>>>> Thanks,
>>>>
>>>> Sebastian
>>>>
>>>> On Sat, Nov 9, 2019 at 8:03 PM Canal de Sebassen <
>>>> thecromicusproductions_at_gmail.com> wrote:
>>>>
>>>>> Thanks for your reply, Giacomo. I'll take your suggestions into
>>>>> consideration when setting up the system.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Sebastian
>>>>>
>>>>> On Thu, Nov 7, 2019 at 6:37 PM Giacomo Fiorin <
>>>>> giacomo.fiorin_at_gmail.com> wrote:
>>>>>
>>>>>> Hi Canal, first of all try upgrading to the latest NAMD nightly
>>>>>> build. Thanks to Jim's help, I added extra checks that make the
>>>>>> input/output functionality more robust (the same checks are used when
>>>>>> writing the NAMD restart files):
>>>>>> https://github.com/Colvars/colvars/pull/276
>>>>>> There is also an important bugfix in the output of the PMF (the
>>>>>> restart files are fine):
>>>>>> https://github.com/Colvars/colvars/pull/259
>>>>>>
>>>>>> About the exchange rate, on modern hardware optimal performance is
>>>>>> around few milliseconds/step, so 1000 steps is kind of short for a full
>>>>>> cycle with all replicas reading each others' files. Best to increase it by
>>>>>> a factor of 10 or more: I would have made its default value the same of the
>>>>>> restart frequency, but there is no telling how long that would be for each
>>>>>> user's input.
>>>>>>
>>>>>> Regarding the PMFs, nothing special is needed. Each replica will
>>>>>> write PMFs with the same contents (the PMF extracted from the shared bias),
>>>>>> so they will be equal minus the fluctuations arising from synchronization.
>>>>>> You are probably confused by the partial output files, which are triggered
>>>>>> by dumpPartialFreeEnergyFile (a flag that is off by default).
>>>>>>
>>>>>> Lastly, Gaussians 0.01 kcal/mol high added every 100 steps is quite a
>>>>>> bit of bias, and will be further multiplied by the number of replicas.
>>>>>>
>>>>>> Giacomo
>>>>>>
>>>>>> On Thu, Nov 7, 2019 at 6:06 PM Canal de Sebassen <
>>>>>> thecromicusproductions_at_gmail.com> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> Say I run a metadynamics simulation with 10 walkers. I then get 10
>>>>>>> different pmf files. If my simulation was in 2D, how do I get a
>>>>>>> single
>>>>>>> energy landscape? Do I use abf_integrate?
>>>>>>>
>>>>>>> Also, what are some good practices when running these kind of
>>>>>>> simulations?
>>>>>>> I haven't found many examples. This is one my current colvars files.
>>>>>>> I plan to get about 1-5 microseconds of data. Is a replicaUpdateFrequency
>>>>>>> of 1000 too large? I tried with a smaller one but I get problems because
>>>>>>> some files of a replica cannot be found by another one (maybe due to
>>>>>>> lagging?).
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Sebastian
>>>>>>>
>>>>>>> colvarsTrajFrequency 100
>>>>>>>
>>>>>>> colvar {
>>>>>>>
>>>>>>> name d1
>>>>>>>
>>>>>>> outputAppliedForce on
>>>>>>> width 0.5
>>>>>>>
>>>>>>> lowerBoundary 0.0
>>>>>>> upperBoundary 30.0
>>>>>>>
>>>>>>> upperWallConstant 100.0
>>>>>>>
>>>>>>> distanceZ {
>>>>>>> forceNoPBC yes
>>>>>>> main {
>>>>>>> atomsFile labels.pdb
>>>>>>> atomsCol B
>>>>>>> atomsColValue 1.0
>>>>>>> }
>>>>>>> ref {
>>>>>>> atomsFile labels.pdb
>>>>>>> atomsCol B
>>>>>>> atomsColValue 2.0
>>>>>>> }
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> colvar {
>>>>>>>
>>>>>>> name d2
>>>>>>>
>>>>>>> outputAppliedForce on
>>>>>>> width 1
>>>>>>>
>>>>>>> lowerBoundary 0.0
>>>>>>> upperBoundary 10.0
>>>>>>>
>>>>>>> upperWallConstant 100.0
>>>>>>>
>>>>>>> coordNum {
>>>>>>> cutoff 4.0
>>>>>>>
>>>>>>>
>>>>>>> group1 {
>>>>>>> atomsFile labels.pdb
>>>>>>> atomsCol O
>>>>>>> atomsColValue 1.0
>>>>>>> }
>>>>>>> group2 {
>>>>>>> atomsFile labels.pdb
>>>>>>> atomsCol B
>>>>>>> atomsColValue 2.0
>>>>>>> }
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> metadynamics {
>>>>>>> colvars d1 d2
>>>>>>>
>>>>>>> useGrids on
>>>>>>> hillWeight 0.01
>>>>>>> newHillFrequency 100
>>>>>>> dumpFreeEnergyFile on
>>>>>>> dumpPartialFreeEnergyFile on
>>>>>>> saveFreeEnergyFile on
>>>>>>> writeHillsTrajectory on
>>>>>>>
>>>>>>> multipleReplicas yes
>>>>>>> replicaID rep1
>>>>>>> replicasRegistry replicas.registry.txt
>>>>>>> replicaUpdateFrequency 1000
>>>>>>> }
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Giacomo Fiorin
>>>>>> Associate Professor of Research, Temple University, Philadelphia, PA
>>>>>> Research collaborator, National Institutes of Health, Bethesda, MD
>>>>>> http://goo.gl/Q3TBQU
>>>>>> https://github.com/giacomofiorin
>>>>>>
>>>>>

-- 
Giacomo Fiorin
Associate Professor of Research, Temple University, Philadelphia, PA
Research collaborator, National Institutes of Health, Bethesda, MD
http://goo.gl/Q3TBQU
https://github.com/giacomofiorin

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2019 - 23:21:01 CST