Re: NAMD hangs with replica option

From: Giacomo Fiorin (giacomo.fiorin_at_gmail.com)
Date: Fri Feb 11 2022 - 07:22:50 CST

Hi there.

Using many replicas for multiple-walkers metadynamics makes sense only as
long as it spreads out the bias across the replicas, i.e. like taking the
average of multiple individual PMFs but with the added benefit of sharing
the "learning" of the landscape across replicas.

But you do need to apportion the bias across replicas for that to happen:
adding replicas while keeping the same hillWeight, which also happens to be
rather high in your case, will just pump energy into the system at an
increasing rate, probably faster than the replicas can relax and
decorrelate from each other. Unless you can afford the same simulation
time *per replica*, stick with fewer but longer simulations. A single
microsecond-long simulation can give you fair results, but one million
picosecond-long simulations is *guaranteed* to give you garbage.

Please consider carefully whether using well-tempered is a requirement in
this case. It is meant to smooth out the PMF as simulation time
progresses, but with many replicas you're effectively doing a lot of
averaging already. Two colleagues of mine with more experience in
metadynamics never use well-tempered, and prefer to just take the average
of PMFs taken at different time snapshots after all the basins have been
filled. I do see the value of having an averaging/smoothing method built
into the simulation, but without a fair expectation of the PMF shape and
correlation times its usefulness is limited.

Giacomo

On Fri, Feb 11, 2022 at 4:21 AM jing liang <jingliang2015_at_gmail.com> wrote:

> Hi,
>
> The replica simulation seems to be working fine now will all your useful
> comments. I explored the possibility of having more
> than 200 replicas. The simulation finished but the resulting PMF looks
> worse than with only 4 replicas. The simulation
> ran for 50000000 steps. The part of the input file for colvars for
> metadynamics looks like:
>
> hillWeight 0.1
> newHillFrequency 1000
> writeHillsTrajectory on
> hillwidth 1.0
>
> multipleReplicas on
> replicasRegistry myrep.txt
> replicaUpdateFrequency 50000
> writePartialFreeEnergyFile on
>
> Is there any recommendation when the number of replicas is larger than
> 4-8? Also, I noticed that the well tempered
> metadynamics became unstable as the simulation crashed with replicas.
> Thanks in advance.
>
>
>
>
>
>
> El mié, 12 ene 2022 a las 15:32, Giacomo Fiorin (<giacomo.fiorin_at_gmail.com>)
> escribió:
>
>> Nope, Colvars already combines them for you into a single PMF, which gets
>> written by all replicas:
>>
>> https://urldefense.com/v3/__https://colvars.github.io/colvars-refman-namd/colvars-refman-namd.html*sec:colvarbias_meta_mr__;Iw!!DZ3fjg!toscDR8exLefQMkcDaNLTzcmTRS28Zgd6nf_PvkwpZboAS9oko3f1ZO9xYldS8_QOQ$
>> each according to their "outputName" prefix, and its contents will be
>> the same, minus small deviations in between synchronizations.
>>
>> If you need to analyze the contributions of each replica, you can use
>> "writePartialFreeEnergyFile on", as you have.
>>
>> Giacomo
>>
>> On Wed, Jan 12, 2022 at 4:21 AM jing liang <jingliang2015_at_gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> your suggestion of using a different name for the output files worked.
>>> Thanks!
>>>
>>> A question derived from this simulation. In a simulation with X replicas
>>> one gets X PMFs, how do you combine all of them? Do you use NAMD (somehow)?
>>> Or maybe just take the average with a simple bash script?
>>>
>>> Have a nice day!
>>>
>>>
>>> El lun, 10 ene 2022 a las 20:27, Giacomo Fiorin (<
>>> giacomo.fiorin_at_gmail.com>) escribió:
>>>
>>>> Hi Jing,
>>>>
>>>>
>>>> On Mon, Jan 10, 2022 at 2:13 PM jing liang <jingliang2015_at_gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> thanks for your comments, outputname is set to "meta" only without a
>>>>> reference to replicas that you mentioned.
>>>>>
>>>>
>>>> Please make use outputName different for each replica as suggested,
>>>> otherwise they'll overwrite each other's output.
>>>>
>>>>
>>>>> May I ask you about the tcl function you mentioned, where could I find
>>>>> its description? I get the following output files:
>>>>>
>>>>
>>>>
>>>> https://www.ks.uiuc.edu/Research/namd/2.14/ug/node9.html#SECTION00052300000000000000
>>>>
>>>>
>>>>>
>>>>> mymtd-replicas.txt
>>>>> meta-distance.5.files.txt.BAK
>>>>> meta-distance.5.files.txt
>>>>> meta-distance.0.files.txt.BAK
>>>>> meta-distance.0.files.txt
>>>>> meta.xst.BAK
>>>>> meta.restart.xsc.old
>>>>> meta.restart.vel.old
>>>>> meta.restart.coor.old
>>>>> meta.restart.colvars.state.old
>>>>> meta.restart.colvars.state
>>>>> meta.pmf.BAK
>>>>> meta.partial.pmf.BAK
>>>>> meta.dcd.BAK
>>>>> meta.colvars.traj.BAK
>>>>> meta.colvars.traj
>>>>> meta.colvars.state.old
>>>>> meta.colvars.meta-distance.5.state
>>>>> meta.colvars.meta-distance.5.hills.traj
>>>>> meta.colvars.meta-distance.5.hills
>>>>> meta.colvars.meta-distance.0.hills.traj
>>>>> meta.xst
>>>>> meta.restart.xsc
>>>>> meta.restart.vel
>>>>> meta.restart.coor
>>>>> meta.pmf
>>>>> meta.partial.pmf
>>>>> meta.dcd
>>>>> meta.colvars.state
>>>>> meta.colvars.meta-distance.0.state
>>>>> meta.colvars.meta-distance.0.hills
>>>>>
>>>>
>>>> This is consistent with your set up, each of those files is being
>>>> written over multiple times, but those that contain the replica ID are
>>>> different (because Colvars detects the replica ID internally from NAMD when
>>>> you launch NAMD with +replicas).
>>>>
>>>>
>>>>> plus the log file of NAMD which contains the information of the
>>>>> replicas I used here. Because I requested 8 replicas I expected more output
>>>>> files. The
>>>>> content of mymtd-replicas.txt (written by NAMD not by me) is:
>>>>>
>>>>> 0 meta-distance.0.files.txt
>>>>> 5 meta-distance.5.files.txt
>>>>>
>>>>> this tells me that somehow NAMD is setting 2 replicas although I
>>>>> requested 8: mpirun -np 112 namd2 +replicas 8 script.inp
>>>>>
>>>>
>>>> Not quite: normally that list would be populated by the replicas, one
>>>> by one. You ask for 8, but then because the replicas write all at the
>>>> same time *onto the same files* they end up with I/O errors and the
>>>> simulation doesn't seem to go on smoothly and the replicas don't get to the
>>>> registration step.
>>>>
>>>>
>>>>>
>>>>> The colvars config file contains the lines:
>>>>>
>>>>> metadynamics {
>>>>> name meta-distance
>>>>> colvars distance1
>>>>> hillWeight 0.1
>>>>> newHillFrequency 1000
>>>>> writeHillsTrajectory on
>>>>> hillwidth 1.0
>>>>>
>>>>> multipleReplicas on
>>>>> replicasRegistry mymtd-replicas.txt
>>>>> replicaUpdateFrequency 50000
>>>>> writePartialFreeEnergyFile on
>>>>> }
>>>>>
>>>>> I am running on a parallel file system for hpc. Any comment will be
>>>>> appreciated. Thanks again.
>>>>>
>>>>
>>>> For now the problem seems not to have differentiated the output prefix
>>>> between replicas. If the problem persists after fixing that, please also
>>>> report what kind of parallel file system (NFS, GPFS, Lustre, ...).
>>>>
>>>>
>>>>>
>>>>> El lun, 10 ene 2022 a las 17:22, Giacomo Fiorin (<
>>>>> giacomo.fiorin_at_gmail.com>) escribió:
>>>>>
>>>>>> Jing, you're probably using different values for outputName if you're
>>>>>> using multipleReplicas on (i.e. multiple walkers), but still, please
>>>>>> confirm that that's what you are using.
>>>>>>
>>>>>> Note also that by using file-based communication the replicas don't
>>>>>> need to be launched with the same command, but can also be run as
>>>>>> independent jobs:
>>>>>>
>>>>>> https://urldefense.com/v3/__https://colvars.github.io/colvars-refman-namd/colvars-refman-namd.html*sec:colvarbias_meta_mr__;Iw!!DZ3fjg!toscDR8exLefQMkcDaNLTzcmTRS28Zgd6nf_PvkwpZboAS9oko3f1ZO9xYldS8_QOQ$
>>>>>> In that framework, the main advantage of +replicas is mostly that the
>>>>>> value of replicaID is filled automatically, so that your Colvars config
>>>>>> file can be identical for all replicas.
>>>>>>
>>>>>> If you are experiencing file I/O issues also when launching replicas
>>>>>> independently (i.e. not with a single NAMD run with +replicas), can you
>>>>>> find out what kind of filesystem you have on the compute nodes?
>>>>>>
>>>>>> Thanks
>>>>>> Giacomo
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Jan 10, 2022 at 9:37 AM Josh Vermaas <vermaasj_at_msu.edu>
>>>>>> wrote:
>>>>>>
>>>>>>> There is definitely a bug in the 2.14 MPI version. One of my
>>>>>>> students
>>>>>>> has noticed that anything that calls NAMD die isn't taking down all
>>>>>>> the
>>>>>>> replicas, and so the jobs will continue to burn resources until they
>>>>>>> reach their wallclock limit.
>>>>>>>
>>>>>>> However, the key is figuring out *why* you are getting an error. I'm
>>>>>>> less familiar with metadynamics, but at least for umbrella sampling,
>>>>>>> it
>>>>>>> is pretty typical for each replica to write out its own set of
>>>>>>> files.
>>>>>>> This is usually done with something like:
>>>>>>>
>>>>>>> outputname somename.[myReplica]
>>>>>>>
>>>>>>> Where [myReplica] is a Tcl function that evaluates to the replica ID
>>>>>>> for
>>>>>>> each semi-independent simulation. For debugging purposes, it can be
>>>>>>> very
>>>>>>> helpful for each replica to spit out its own log file. This is
>>>>>>> usually
>>>>>>> done by setting the +stdout option on the command line.
>>>>>>>
>>>>>>> mpirun -np 28 namd2 +replicas 2 namd_metadynamics.inp +stdout
>>>>>>> outputlog.%d.log
>>>>>>>
>>>>>>> -Josh
>>>>>>>
>>>>>>> On 1/9/22 2:34 PM, jing liang wrote:
>>>>>>> > Hi,
>>>>>>> >
>>>>>>> > I am running a metadynamics simulation with NAMD 2.14 MPI version.
>>>>>>> > SLURM is being used for job scheduling, the way to run it by using
>>>>>>> 2
>>>>>>> > replica on a 14 cores node is as follows:
>>>>>>> >
>>>>>>> > mpirun -np 28 namd2 +replicas 2 namd_metadynamics.inp
>>>>>>> >
>>>>>>> > In fact, I have tried upto 8 replicas and the resulting pmf looks
>>>>>>> very
>>>>>>> > similar
>>>>>>> > to what I obtain with other methods such as ABF. The problem is
>>>>>>> that
>>>>>>> > by using
>>>>>>> > the replicas option, the simulation hangs right at the end. I have
>>>>>>> > looked at the
>>>>>>> > output files and it seems that right at the end NAMD wants to
>>>>>>> access
>>>>>>> > some files
>>>>>>> > (for example, *.xsc, *hills*, ...) that already exist and NAMD
>>>>>>> throws
>>>>>>> > an error.
>>>>>>> >
>>>>>>> > My guess is that this could be either a misunderstanding from my
>>>>>>> side
>>>>>>> > in running NAMD with replicas or a bug in the MPI version.
>>>>>>> >
>>>>>>> > Have you observed that issue previously? Any comment is welcome.
>>>>>>> Thanks
>>>>>>> >
>>>>>>>
>>>>>>> --
>>>>>>> Josh Vermaas
>>>>>>>
>>>>>>> vermaasj_at_msu.edu
>>>>>>> Assistant Professor, Plant Research Laboratory and Biochemistry and
>>>>>>> Molecular Biology
>>>>>>> Michigan State University
>>>>>>>
>>>>>>> https://urldefense.com/v3/__https://prl.natsci.msu.edu/people/faculty/josh-vermaas/__;!!DZ3fjg!qxoAM7sAMD7OOX4XekBXNyDSDwyL5GBEa1rt9qiV-ok0frmrn27DsCUvWPCFTfWyyQ$
>>>>>>>
>>>>>>>
>>>>>>>

This archive was generated by hypermail 2.1.6 : Tue Dec 13 2022 - 14:32:44 CST