Re: Buffer Underflow in Multiple Walker MetaDynamics

From: Giacomo Fiorin (giacomo.fiorin_at_gmail.com)
Date: Thu Jul 05 2012 - 17:25:02 CDT

Hi, Aron.

So in the case you mentioned, where I run for 1ns or something, and then
> start from the outputs and run another 1ns, I would change the output names
> as I normally would, and that is all. And if I had a simulation crash in
> the middle of a run, I would do much the same, but also change the
> firsttimestep to be whatever it crashed at?
>

Changing firsttimestep wound't change the results of the run, but may be
more useful for you to keep track of the crashed jobs.

> Also, whenever I've been doing this, I've been deleting the
> Replica_files.txt file that holds the information for where each replica
> is, otherwise all the previous hill information gets loaded at the
> beginning in addition to already loading the restart.state file. Is this
> the right thing to do, or am I causing problems by being overly cautious?
>

I don't think this is a problem, for as long as all replicas restart at the
same time, so that they have a chance to write their own entry in the .txt
file asap. If some of the replicas resume running later, there is no way
for the replicas who are already running to know that they even exist.

> And finally, since the old *.hills file isn't being read, but rather, the
> *.restart.colvars, this means that any hills written after the last
> colvarsRestartFrequency are lost, correct? And assuming the previous is
> true, does that mean that ideally one would have the
> colvarsRestartFrequency = the main restartfreq (I suppose this applies to
> more than just multiple replicas situations)?
>

You should have colvarsRestartFrequency equal to restartfreq most of the
time anyway, which is why restartfreq is the default. A good reason to
change it may be only to minimize I/O when the colvars.state file is not
important (e.g. you're applying restraints that don't depend on time).

> I'm finding the multiple replicas functionality quite ideal for leading to
> better convergence (less hysteresis) without having to do a lot of
> averaging or other things (I suppose the multiple replicas is it's own kind
> of averaging, but I feel as though the communication between replicas makes
> it a little better than just that... maybe I'm being overly optimistic).

Well, I don't know about that. You should allow for all replicas to
decorrelate, i.e. cross between different basin. If you initially set 5
replicas in the global minimum, and 10 in a local minimum, and the two
minima are next to each other, you may mistakenly assume that the local
minimum is the global one. Only after some of the 10 replicas migrate to
the local minimum and viceversa you could start checking whether the
simulation is converged or not.

A plot of the colvars trajectory of each replica should be able to tell.

Bests
Giacomo

>
>
> ~Aron
>
>
> On Thu, Jun 14, 2012 at 6:25 PM, Giacomo Fiorin <giacomo.fiorin_at_gmail.com>wrote:
>
>> First question: no, keep the same replicaIDs, but change the value of the
>> main NAMD configuration outputName option for each of them. I presume
>> you're already doing this, otherwise the new jobs will overwrite the
>> previous data, and you'll lose all the previous trajectory. In short, just
>> set numsteps to 1000000 and don't change anything else.
>>
>> Yes about the ratio between replicaUpdateFrequency and newHillFrequency.
>>
>> I strongly recommend recompiling, for reasons that have nothing to do
>> with the collective variables module. You're using a CUDA version, for
>> which two components (the CUDA library itself and the NAMD support) are
>> changing rather fast. Of course, 2.9 came out recently, so I don't know
>> how big the changes in CUDA are in the CVS version (I'm sure others can
>> fill in here). But compiling NAMD is a bit tough in the beginning to
>> learn, but gets easier and not harder with every version, so the investment
>> in time pays off big time.
>>
>> G.
>>
>> On Thu, Jun 14, 2012 at 6:12 PM, Aron Broom <broomsday_at_gmail.com> wrote:
>>
>>> Yes, the restart frequency is 2ns, I just had is as something rather
>>> large because the jobs had generally not been failing.
>>>
>>> In terms of your suggestion about 1,000,000 steps, you mean that a new
>>> job with a new output name should be started at that point? And then it
>>> will be added into the Replica file automatically and start communicating
>>> with everything else (and read in all the hills from the previous set of
>>> jobs)? If I have 16 replicas, does that mean that for this new set of jobs
>>> with the new output names, I'll want the replicaID to go from 17 through
>>> 32, and then keep incrementing for the next round of restarts?
>>>
>>> I'll certainly reduce the replicaUpdateFrequency, as I was concerned
>>> about that. Also, just so that I'm not confused, if the newHillFrequency
>>> is 500 and the replicaUpdateFrequency is 1000, it updates every 2 hills?
>>>
>>> I hadn't yet compiled a version of NAMD with that patch as I had hit a
>>> mental roadblock in terms of compiling (have always just used the
>>> precompiled binaries). I have now started down the compiling path, but it
>>> seems there are many places things can go wrong, but I appreciate that
>>> having those hard boundaries would make things much better. Hopefully I'll
>>> have it compiled in time to do a run if this current one suffers from the
>>> same problems.
>>>
>>> Thanks for the all the suggestions.
>>>
>>> ~Aron
>>>
>>> On Thu, Jun 14, 2012 at 4:54 PM, Giacomo Fiorin <
>>> giacomo.fiorin_at_gmail.com> wrote:
>>>
>>>> Hey, that's a very large value for restartFrequency! It's probably 2
>>>> ns, right? At this point, I would suggest stopping a job after 1,000,000
>>>> steps, write all the restart files, and start with a new job. The other
>>>> replicas will then be forced to do a re-sync and read the new state file,
>>>> which contains the complete llist.
>>>>
>>>> Also, at 0.01 s/step you're not super-fast, I think you can afford to
>>>> bring replicaUpdateFrequency down to 1000 and keep more in sync.
>>>>
>>>> Yes, if grids are enabled the grids contain all the hills: the
>>>> analytical hills are only for the event that you leave the grids'
>>>> boundaries.
>>>>
>>>> Are you using the patch that I sent you to specifically eliminate the
>>>> analytical hills if you don't need them?
>>>>
>>>> G.
>>>>
>>>>
>>>> On Thu, Jun 14, 2012 at 4:47 PM, Aron Broom <broomsday_at_gmail.com>wrote:
>>>>
>>>>> Yes I was thinking to add larger hills less often and to also increase
>>>>> the replicaUpdateFrequency.
>>>>>
>>>>> I killed the previous runs and restarted, but in a moment of extreme
>>>>> stupidity, I didn't save the log files before clearing out the directories
>>>>> for the restart, so I'm unable to search for that line. I will not make
>>>>> that mistake again (I don't see that error happening in the logs for my 48
>>>>> replica run that completed properly).
>>>>>
>>>>> The values for the replicas were (I've added commas for readability):
>>>>>
>>>>> newHillFrequency 500
>>>>> replicaUpdateFrequency 10,000
>>>>> restartFrequency 1,000,000
>>>>>
>>>>> The run was progressing at 0.01 s/step, so I guess that is ~5 seconds
>>>>> per hill addition, and more importantly ~100 seconds per update (I have no
>>>>> sense of how that compares against the time needed to read a file).
>>>>>
>>>>> I've made the rather minor change of the newHillFrequency going to
>>>>> 1000 (and increased the hill size accordingly) in order to have fewer hills
>>>>> being passed around and saved in that file, and I've increased the grid
>>>>> boundaries substantially such that there are now 10 bin widths between the
>>>>> walls and the boundaries. If this fails I will attempt your recommendation
>>>>> of increasing the value of the hillfrequency and updatefrequency.
>>>>>
>>>>> In terms of the filesystem, the different nodes all share the same
>>>>> filesystem. I'm not sure what the filesystem it is though, the OS is
>>>>> CentOS. I can find out about this if it is useful.
>>>>>
>>>>> Thanks for the suggestions, I'll continue to look for that warning as
>>>>> I check on things, hopefully it was just some random hardware glitch.
>>>>>
>>>>> One more question though, for the multiple walker stuff, all the hills
>>>>> are saved analytically for each walker, and then when another walker reads
>>>>> those, it adds that to it's own grid? So regardless of grid boundaries,
>>>>> all the hill files grow over time? And I presume it takes longer to access
>>>>> a larger file than it does a small one, so it is best to have the fewest,
>>>>> largest hills that are still tolerable in terms of accuracy?
>>>>>
>>>>> ~Aron
>>>>>
>>>>>
>>>>> On Thu, Jun 14, 2012 at 4:20 PM, Giacomo Fiorin <
>>>>> giacomo.fiorin_at_gmail.com> wrote:
>>>>>
>>>>>> I see.
>>>>>>
>>>>>> What you said in 4) sounds indeed suspicious: are the different nodes
>>>>>> sharing the same filesystem through NFS? The same problems may responsible
>>>>>> for 5), e.g. you don't get a read error but the filesystem doesn't catch up
>>>>>> as often as it should. I didn't have very good experience with NFS
>>>>>> filesystems.
>>>>>>
>>>>>> Indeed one thing you can try is to make the hills larger and add them
>>>>>> less often. A good start would be to make them 16 times larger, and add
>>>>>> them 16 times less often (make it 10 and 10 to not mess up with the restart
>>>>>> frequency, of course).
>>>>>>
>>>>>> Or even better, keep the hills as they are, but increase instead
>>>>>> replicaUpdateFrequency (so you give the replicas more time to empty their
>>>>>> buffers).
>>>>>>
>>>>>> Btw, are you getting the following warning?
>>>>>>
>>>>>> Warning: in metadynamics bias "metadynamics1": failed to read
>>>>>> completely output files from replica "xxx" ...
>>>>>>
>>>>>> Also, I'd like to know your exact values of
>>>>>> newHillFrequency, replicaUpdateFrequency and restartFreq.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jun 14, 2012 at 3:56 PM, Aron Broom <broomsday_at_gmail.com>wrote:
>>>>>>
>>>>>>> So more details, some of which have me quite puzzled:
>>>>>>>
>>>>>>> 1) The run started with 16 replicas, all initialized fine, and all
>>>>>>> hills were being shared between all replicas (as judged by the output
>>>>>>> claiming to have received a hill from replica X)
>>>>>>>
>>>>>>> 2) At the same time both replicas 4 and 16 failed due to the error
>>>>>>> mentioned. Also at the same time, replica 15 failed due to a different
>>>>>>> error:
>>>>>>>
>>>>>>> colvars: Error: cannot read from file
>>>>>>> "/work/broom/ThreeFoil_Galactose/GB_3D_Production/16_Replica_Run/Run_1/Meta_Galactose_GB_Run_100ns.colvars.metadynamics1.1.hills".
>>>>>>> colvars: If this error message is unclear, try recompiling with
>>>>>>> -DCOLVARS_DEBUG.
>>>>>>> FATAL ERROR: Error in the collective variables module: exiting.
>>>>>>>
>>>>>>> 3) The remaining replicas have continued since then.
>>>>>>>
>>>>>>> 4) I have another 16 replica simulation running in a completely
>>>>>>> different folder, using different nodes, and it also had 3 failures, and
>>>>>>> they appear to be at least within the same minute based on the wallclock.
>>>>>>> Maybe this suggests some kind of hardware problem that occurred at that
>>>>>>> time?
>>>>>>>
>>>>>>> 5) The other thing I'm noticing is that hill updates from some
>>>>>>> replicas that are still running seem to stop occurring for a long time, and
>>>>>>> then a large chunk of them are added, with the message that X hills are
>>>>>>> close to the grid boundaries and are being computed analytically. I see
>>>>>>> the reason for this, but I'm wondering if perhaps that is partially to
>>>>>>> blame in all of this, and I should increase my grid boundaries
>>>>>>> substantially?
>>>>>>>
>>>>>>> 6) One last thing to note is that I recently had a 48 replica run
>>>>>>> complete without trouble, although in terms of communication, each replica
>>>>>>> only needed to get half as many hills, half as often.
>>>>>>>
>>>>>>> ~Aron
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jun 14, 2012 at 2:48 PM, Giacomo Fiorin <
>>>>>>> giacomo.fiorin_at_gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Aron, indeed this is very interesting, I've never seen this
>>>>>>>> low-level error message yet.
>>>>>>>>
>>>>>>>> Internally, the code that replica A catches an internal error when
>>>>>>>> it can't read correctly a new block of data from replica B, because it was
>>>>>>>> not completely written yet. Then, replica A moves on but saves the
>>>>>>>> position in the file of replica B, and tries to read again at the next
>>>>>>>> update.
>>>>>>>>
>>>>>>>> If the file has been written partially (i.e. it stopped in the
>>>>>>>> middle of writing a number), you should get a warning that the above has
>>>>>>>> happened and that the simulation continues.
>>>>>>>>
>>>>>>>> It looks like this is a lower-level error (you just can't get
>>>>>>>> characters out of the files).
>>>>>>>>
>>>>>>>> How often has the error shown up? Has it occurred at all times
>>>>>>>> since the beginning of the simulation, or was its occurrence only in
>>>>>>>> certain periods?
>>>>>>>>
>>>>>>>> On Thu, Jun 14, 2012 at 2:26 PM, Aron Broom <broomsday_at_gmail.com>wrote:
>>>>>>>>
>>>>>>>>> I'm running multiple walker MetaDynamics, and for a few of the
>>>>>>>>> replicas, after a random period of time, the run crashes with the following
>>>>>>>>> error:
>>>>>>>>>
>>>>>>>>> terminate called after throwing an instance of
>>>>>>>>> 'std::ios_base::failure'
>>>>>>>>> what(): basic_filebuf::underflow error reading the file
>>>>>>>>> /var/spool/torque/mom_priv/jobs/4378.mon240.monk.sharcnet.SC:
>>>>>>>>> line 3: 30588 Aborted
>>>>>>>>> (core dumped)
>>>>>>>>> ../../../../NAMD/NAMD_2.9_Linux-x86_64-multicore-CUDA/namd2 +p4 +idlepoll
>>>>>>>>> +mergegrids Galactose_Meta_Run.namd
>>>>>>>>>
>>>>>>>>> I suspect the last two lines are rather meaningless, but I
>>>>>>>>> included them for completeness. I'm not sure, but I think this results
>>>>>>>>> when replica A is attempting to read the hills from replica B while replica
>>>>>>>>> B is adding new hills, or alternatively when two replicas are trying to
>>>>>>>>> read hills from another replica at the same time. If that is the case,
>>>>>>>>> then I suppose losing some synchronization between the replicas by
>>>>>>>>> increasing the time between updates might help. But I'd ideally like to
>>>>>>>>> avoid that, and was wondering if maybe this is a hardware or operating
>>>>>>>>> system specific problem?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> ~Aron
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Aron Broom M.Sc
>>>>>>>>> PhD Student
>>>>>>>>> Department of Chemistry
>>>>>>>>> University of Waterloo
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Aron Broom M.Sc
>>>>>>> PhD Student
>>>>>>> Department of Chemistry
>>>>>>> University of Waterloo
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Aron Broom M.Sc
>>>>> PhD Student
>>>>> Department of Chemistry
>>>>> University of Waterloo
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Aron Broom M.Sc
>>> PhD Student
>>> Department of Chemistry
>>> University of Waterloo
>>>
>>>
>>
>
>
> --
> Aron Broom M.Sc
> PhD Student
> Department of Chemistry
> University of Waterloo
>
>

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:44 CST