From: Giacomo Fiorin (giacomo.fiorin_at_gmail.com)
Date: Mon Jan 10 2022 - 13:26:53 CST
Hi Jing,
On Mon, Jan 10, 2022 at 2:13 PM jing liang <jingliang2015_at_gmail.com> wrote:
> Hi,
>
> thanks for your comments, outputname is set to "meta" only without a
> reference to replicas that you mentioned.
>
Please make use outputName different for each replica as suggested,
otherwise they'll overwrite each other's output.
> May I ask you about the tcl function you mentioned, where could I find
> its description? I get the following output files:
>
https://www.ks.uiuc.edu/Research/namd/2.14/ug/node9.html#SECTION00052300000000000000
>
> mymtd-replicas.txt
> meta-distance.5.files.txt.BAK
> meta-distance.5.files.txt
> meta-distance.0.files.txt.BAK
> meta-distance.0.files.txt
> meta.xst.BAK
> meta.restart.xsc.old
> meta.restart.vel.old
> meta.restart.coor.old
> meta.restart.colvars.state.old
> meta.restart.colvars.state
> meta.pmf.BAK
> meta.partial.pmf.BAK
> meta.dcd.BAK
> meta.colvars.traj.BAK
> meta.colvars.traj
> meta.colvars.state.old
> meta.colvars.meta-distance.5.state
> meta.colvars.meta-distance.5.hills.traj
> meta.colvars.meta-distance.5.hills
> meta.colvars.meta-distance.0.hills.traj
> meta.xst
> meta.restart.xsc
> meta.restart.vel
> meta.restart.coor
> meta.pmf
> meta.partial.pmf
> meta.dcd
> meta.colvars.state
> meta.colvars.meta-distance.0.state
> meta.colvars.meta-distance.0.hills
>
This is consistent with your set up, each of those files is being written
over multiple times, but those that contain the replica ID are different
(because Colvars detects the replica ID internally from NAMD when you
launch NAMD with +replicas).
> plus the log file of NAMD which contains the information of the replicas I
> used here. Because I requested 8 replicas I expected more output files. The
> content of mymtd-replicas.txt (written by NAMD not by me) is:
>
> 0 meta-distance.0.files.txt
> 5 meta-distance.5.files.txt
>
> this tells me that somehow NAMD is setting 2 replicas although I requested
> 8: mpirun -np 112 namd2 +replicas 8 script.inp
>
Not quite: normally that list would be populated by the replicas, one by
one. You ask for 8, but then because the replicas write all at the same
time *onto the same files* they end up with I/O errors and the simulation
doesn't seem to go on smoothly and the replicas don't get to the
registration step.
>
> The colvars config file contains the lines:
>
> metadynamics {
> name meta-distance
> colvars distance1
> hillWeight 0.1
> newHillFrequency 1000
> writeHillsTrajectory on
> hillwidth 1.0
>
> multipleReplicas on
> replicasRegistry mymtd-replicas.txt
> replicaUpdateFrequency 50000
> writePartialFreeEnergyFile on
> }
>
> I am running on a parallel file system for hpc. Any comment will be
> appreciated. Thanks again.
>
For now the problem seems not to have differentiated the output prefix
between replicas. If the problem persists after fixing that, please also
report what kind of parallel file system (NFS, GPFS, Lustre, ...).
>
> El lun, 10 ene 2022 a las 17:22, Giacomo Fiorin (<giacomo.fiorin_at_gmail.com>)
> escribió:
>
>> Jing, you're probably using different values for outputName if you're
>> using multipleReplicas on (i.e. multiple walkers), but still, please
>> confirm that that's what you are using.
>>
>> Note also that by using file-based communication the replicas don't need
>> to be launched with the same command, but can also be run as independent
>> jobs:
>>
>> https://urldefense.com/v3/__https://colvars.github.io/colvars-refman-namd/colvars-refman-namd.html*sec:colvarbias_meta_mr__;Iw!!DZ3fjg!vry-u47efx-leswA8tmwg2Jp65WqkHUcAHvRjsmKpmLscgDcxhycgCUHgnH5eO4_-A$
>> In that framework, the main advantage of +replicas is mostly that the
>> value of replicaID is filled automatically, so that your Colvars config
>> file can be identical for all replicas.
>>
>> If you are experiencing file I/O issues also when launching replicas
>> independently (i.e. not with a single NAMD run with +replicas), can you
>> find out what kind of filesystem you have on the compute nodes?
>>
>> Thanks
>> Giacomo
>>
>>
>>
>> On Mon, Jan 10, 2022 at 9:37 AM Josh Vermaas <vermaasj_at_msu.edu> wrote:
>>
>>> There is definitely a bug in the 2.14 MPI version. One of my students
>>> has noticed that anything that calls NAMD die isn't taking down all the
>>> replicas, and so the jobs will continue to burn resources until they
>>> reach their wallclock limit.
>>>
>>> However, the key is figuring out *why* you are getting an error. I'm
>>> less familiar with metadynamics, but at least for umbrella sampling, it
>>> is pretty typical for each replica to write out its own set of files.
>>> This is usually done with something like:
>>>
>>> outputname somename.[myReplica]
>>>
>>> Where [myReplica] is a Tcl function that evaluates to the replica ID for
>>> each semi-independent simulation. For debugging purposes, it can be very
>>> helpful for each replica to spit out its own log file. This is usually
>>> done by setting the +stdout option on the command line.
>>>
>>> mpirun -np 28 namd2 +replicas 2 namd_metadynamics.inp +stdout
>>> outputlog.%d.log
>>>
>>> -Josh
>>>
>>> On 1/9/22 2:34 PM, jing liang wrote:
>>> > Hi,
>>> >
>>> > I am running a metadynamics simulation with NAMD 2.14 MPI version.
>>> > SLURM is being used for job scheduling, the way to run it by using 2
>>> > replica on a 14 cores node is as follows:
>>> >
>>> > mpirun -np 28 namd2 +replicas 2 namd_metadynamics.inp
>>> >
>>> > In fact, I have tried upto 8 replicas and the resulting pmf looks very
>>> > similar
>>> > to what I obtain with other methods such as ABF. The problem is that
>>> > by using
>>> > the replicas option, the simulation hangs right at the end. I have
>>> > looked at the
>>> > output files and it seems that right at the end NAMD wants to access
>>> > some files
>>> > (for example, *.xsc, *hills*, ...) that already exist and NAMD throws
>>> > an error.
>>> >
>>> > My guess is that this could be either a misunderstanding from my side
>>> > in running NAMD with replicas or a bug in the MPI version.
>>> >
>>> > Have you observed that issue previously? Any comment is welcome. Thanks
>>> >
>>>
>>> --
>>> Josh Vermaas
>>>
>>> vermaasj_at_msu.edu
>>> Assistant Professor, Plant Research Laboratory and Biochemistry and
>>> Molecular Biology
>>> Michigan State University
>>>
>>> https://urldefense.com/v3/__https://prl.natsci.msu.edu/people/faculty/josh-vermaas/__;!!DZ3fjg!qxoAM7sAMD7OOX4XekBXNyDSDwyL5GBEa1rt9qiV-ok0frmrn27DsCUvWPCFTfWyyQ$
>>>
>>>
>>>
This archive was generated by hypermail 2.1.6 : Tue Dec 13 2022 - 14:32:44 CST