Re: NAMD hangs with replica option

From: jing liang (jingliang2015_at_gmail.com)
Date: Wed Jan 12 2022 - 03:21:22 CST

Hi,

your suggestion of using a different name for the output files worked.
Thanks!

A question derived from this simulation. In a simulation with X replicas
one gets X PMFs, how do you combine all of them? Do you use NAMD (somehow)?
Or maybe just take the average with a simple bash script?

Have a nice day!

El lun, 10 ene 2022 a las 20:27, Giacomo Fiorin (<giacomo.fiorin_at_gmail.com>)
escribió:

> Hi Jing,
>
>
> On Mon, Jan 10, 2022 at 2:13 PM jing liang <jingliang2015_at_gmail.com>
> wrote:
>
>> Hi,
>>
>> thanks for your comments, outputname is set to "meta" only without a
>> reference to replicas that you mentioned.
>>
>
> Please make use outputName different for each replica as suggested,
> otherwise they'll overwrite each other's output.
>
>
>> May I ask you about the tcl function you mentioned, where could I find
>> its description? I get the following output files:
>>
>
>
> https://www.ks.uiuc.edu/Research/namd/2.14/ug/node9.html#SECTION00052300000000000000
>
>
>>
>> mymtd-replicas.txt
>> meta-distance.5.files.txt.BAK
>> meta-distance.5.files.txt
>> meta-distance.0.files.txt.BAK
>> meta-distance.0.files.txt
>> meta.xst.BAK
>> meta.restart.xsc.old
>> meta.restart.vel.old
>> meta.restart.coor.old
>> meta.restart.colvars.state.old
>> meta.restart.colvars.state
>> meta.pmf.BAK
>> meta.partial.pmf.BAK
>> meta.dcd.BAK
>> meta.colvars.traj.BAK
>> meta.colvars.traj
>> meta.colvars.state.old
>> meta.colvars.meta-distance.5.state
>> meta.colvars.meta-distance.5.hills.traj
>> meta.colvars.meta-distance.5.hills
>> meta.colvars.meta-distance.0.hills.traj
>> meta.xst
>> meta.restart.xsc
>> meta.restart.vel
>> meta.restart.coor
>> meta.pmf
>> meta.partial.pmf
>> meta.dcd
>> meta.colvars.state
>> meta.colvars.meta-distance.0.state
>> meta.colvars.meta-distance.0.hills
>>
>
> This is consistent with your set up, each of those files is being written
> over multiple times, but those that contain the replica ID are different
> (because Colvars detects the replica ID internally from NAMD when you
> launch NAMD with +replicas).
>
>
>> plus the log file of NAMD which contains the information of the replicas
>> I used here. Because I requested 8 replicas I expected more output files.
>> The
>> content of mymtd-replicas.txt (written by NAMD not by me) is:
>>
>> 0 meta-distance.0.files.txt
>> 5 meta-distance.5.files.txt
>>
>> this tells me that somehow NAMD is setting 2 replicas although I
>> requested 8: mpirun -np 112 namd2 +replicas 8 script.inp
>>
>
> Not quite: normally that list would be populated by the replicas, one by
> one. You ask for 8, but then because the replicas write all at the same
> time *onto the same files* they end up with I/O errors and the simulation
> doesn't seem to go on smoothly and the replicas don't get to the
> registration step.
>
>
>>
>> The colvars config file contains the lines:
>>
>> metadynamics {
>> name meta-distance
>> colvars distance1
>> hillWeight 0.1
>> newHillFrequency 1000
>> writeHillsTrajectory on
>> hillwidth 1.0
>>
>> multipleReplicas on
>> replicasRegistry mymtd-replicas.txt
>> replicaUpdateFrequency 50000
>> writePartialFreeEnergyFile on
>> }
>>
>> I am running on a parallel file system for hpc. Any comment will be
>> appreciated. Thanks again.
>>
>
> For now the problem seems not to have differentiated the output prefix
> between replicas. If the problem persists after fixing that, please also
> report what kind of parallel file system (NFS, GPFS, Lustre, ...).
>
>
>>
>> El lun, 10 ene 2022 a las 17:22, Giacomo Fiorin (<
>> giacomo.fiorin_at_gmail.com>) escribió:
>>
>>> Jing, you're probably using different values for outputName if you're
>>> using multipleReplicas on (i.e. multiple walkers), but still, please
>>> confirm that that's what you are using.
>>>
>>> Note also that by using file-based communication the replicas don't need
>>> to be launched with the same command, but can also be run as independent
>>> jobs:
>>>
>>> https://urldefense.com/v3/__https://colvars.github.io/colvars-refman-namd/colvars-refman-namd.html*sec:colvarbias_meta_mr__;Iw!!DZ3fjg!tCYY4KIdvk_BtvqZsBXgJVGGOg9ponSsRg0_RoyBuZ3R7C6RhrEnm7UVv2tK1Tj_zA$
>>> In that framework, the main advantage of +replicas is mostly that the
>>> value of replicaID is filled automatically, so that your Colvars config
>>> file can be identical for all replicas.
>>>
>>> If you are experiencing file I/O issues also when launching replicas
>>> independently (i.e. not with a single NAMD run with +replicas), can you
>>> find out what kind of filesystem you have on the compute nodes?
>>>
>>> Thanks
>>> Giacomo
>>>
>>>
>>>
>>> On Mon, Jan 10, 2022 at 9:37 AM Josh Vermaas <vermaasj_at_msu.edu> wrote:
>>>
>>>> There is definitely a bug in the 2.14 MPI version. One of my students
>>>> has noticed that anything that calls NAMD die isn't taking down all the
>>>> replicas, and so the jobs will continue to burn resources until they
>>>> reach their wallclock limit.
>>>>
>>>> However, the key is figuring out *why* you are getting an error. I'm
>>>> less familiar with metadynamics, but at least for umbrella sampling, it
>>>> is pretty typical for each replica to write out its own set of files.
>>>> This is usually done with something like:
>>>>
>>>> outputname somename.[myReplica]
>>>>
>>>> Where [myReplica] is a Tcl function that evaluates to the replica ID
>>>> for
>>>> each semi-independent simulation. For debugging purposes, it can be
>>>> very
>>>> helpful for each replica to spit out its own log file. This is usually
>>>> done by setting the +stdout option on the command line.
>>>>
>>>> mpirun -np 28 namd2 +replicas 2 namd_metadynamics.inp +stdout
>>>> outputlog.%d.log
>>>>
>>>> -Josh
>>>>
>>>> On 1/9/22 2:34 PM, jing liang wrote:
>>>> > Hi,
>>>> >
>>>> > I am running a metadynamics simulation with NAMD 2.14 MPI version.
>>>> > SLURM is being used for job scheduling, the way to run it by using 2
>>>> > replica on a 14 cores node is as follows:
>>>> >
>>>> > mpirun -np 28 namd2 +replicas 2 namd_metadynamics.inp
>>>> >
>>>> > In fact, I have tried upto 8 replicas and the resulting pmf looks
>>>> very
>>>> > similar
>>>> > to what I obtain with other methods such as ABF. The problem is that
>>>> > by using
>>>> > the replicas option, the simulation hangs right at the end. I have
>>>> > looked at the
>>>> > output files and it seems that right at the end NAMD wants to access
>>>> > some files
>>>> > (for example, *.xsc, *hills*, ...) that already exist and NAMD throws
>>>> > an error.
>>>> >
>>>> > My guess is that this could be either a misunderstanding from my side
>>>> > in running NAMD with replicas or a bug in the MPI version.
>>>> >
>>>> > Have you observed that issue previously? Any comment is welcome.
>>>> Thanks
>>>> >
>>>>
>>>> --
>>>> Josh Vermaas
>>>>
>>>> vermaasj_at_msu.edu
>>>> Assistant Professor, Plant Research Laboratory and Biochemistry and
>>>> Molecular Biology
>>>> Michigan State University
>>>>
>>>> https://urldefense.com/v3/__https://prl.natsci.msu.edu/people/faculty/josh-vermaas/__;!!DZ3fjg!qxoAM7sAMD7OOX4XekBXNyDSDwyL5GBEa1rt9qiV-ok0frmrn27DsCUvWPCFTfWyyQ$
>>>>
>>>>
>>>>

This archive was generated by hypermail 2.1.6 : Tue Dec 13 2022 - 14:32:44 CST