Re: NAMD hangs with replica option

From: Giacomo Fiorin (giacomo.fiorin_at_gmail.com)
Date: Mon Jan 10 2022 - 10:22:13 CST

Jing, you're probably using different values for outputName if you're using
multipleReplicas on (i.e. multiple walkers), but still, please confirm that
that's what you are using.

Note also that by using file-based communication the replicas don't need to
be launched with the same command, but can also be run as independent jobs:
https://urldefense.com/v3/__https://colvars.github.io/colvars-refman-namd/colvars-refman-namd.html*sec:colvarbias_meta_mr__;Iw!!DZ3fjg!rF5NwjvTUIgpOEE1FbWVTMYMoa0feSCDnSu13HLh3dKfZfsEHsbeIaEVo3mXoaNZ2Q$
In that framework, the main advantage of +replicas is mostly that the value
of replicaID is filled automatically, so that your Colvars config file can
be identical for all replicas.

If you are experiencing file I/O issues also when launching replicas
independently (i.e. not with a single NAMD run with +replicas), can you
find out what kind of filesystem you have on the compute nodes?

Thanks
Giacomo

On Mon, Jan 10, 2022 at 9:37 AM Josh Vermaas <vermaasj_at_msu.edu> wrote:

> There is definitely a bug in the 2.14 MPI version. One of my students
> has noticed that anything that calls NAMD die isn't taking down all the
> replicas, and so the jobs will continue to burn resources until they
> reach their wallclock limit.
>
> However, the key is figuring out *why* you are getting an error. I'm
> less familiar with metadynamics, but at least for umbrella sampling, it
> is pretty typical for each replica to write out its own set of files.
> This is usually done with something like:
>
> outputname somename.[myReplica]
>
> Where [myReplica] is a Tcl function that evaluates to the replica ID for
> each semi-independent simulation. For debugging purposes, it can be very
> helpful for each replica to spit out its own log file. This is usually
> done by setting the +stdout option on the command line.
>
> mpirun -np 28 namd2 +replicas 2 namd_metadynamics.inp +stdout
> outputlog.%d.log
>
> -Josh
>
> On 1/9/22 2:34 PM, jing liang wrote:
> > Hi,
> >
> > I am running a metadynamics simulation with NAMD 2.14 MPI version.
> > SLURM is being used for job scheduling, the way to run it by using 2
> > replica on a 14 cores node is as follows:
> >
> > mpirun -np 28 namd2 +replicas 2 namd_metadynamics.inp
> >
> > In fact, I have tried upto 8 replicas and the resulting pmf looks very
> > similar
> > to what I obtain with other methods such as ABF. The problem is that
> > by using
> > the replicas option, the simulation hangs right at the end. I have
> > looked at the
> > output files and it seems that right at the end NAMD wants to access
> > some files
> > (for example, *.xsc, *hills*, ...) that already exist and NAMD throws
> > an error.
> >
> > My guess is that this could be either a misunderstanding from my side
> > in running NAMD with replicas or a bug in the MPI version.
> >
> > Have you observed that issue previously? Any comment is welcome. Thanks
> >
>
> --
> Josh Vermaas
>
> vermaasj_at_msu.edu
> Assistant Professor, Plant Research Laboratory and Biochemistry and
> Molecular Biology
> Michigan State University
>
> https://urldefense.com/v3/__https://prl.natsci.msu.edu/people/faculty/josh-vermaas/__;!!DZ3fjg!qxoAM7sAMD7OOX4XekBXNyDSDwyL5GBEa1rt9qiV-ok0frmrn27DsCUvWPCFTfWyyQ$
>
>
>

This archive was generated by hypermail 2.1.6 : Tue Dec 13 2022 - 14:32:44 CST