Re: Buffer Underflow in Multiple Walker MetaDynamics

From: Aron Broom (broomsday_at_gmail.com)
Date: Thu Jul 05 2012 - 18:01:04 CDT

Thanks for the answers!

One thing I didn't quite understand, you said:

"I don't think this is a problem, for as long as all replicas restart at
the same time, so that they have a chance to write their own entry in the
.txt file asap. If some of the replicas resume running later, there is no
way for the replicas who are already running to know that they even exist."

But whenever I've run things, some of the replicas do start a bit later
than others, and once they start, they write their information to that
file, and it appears from the log file as though the replicas that started
first do then start to read their hills. I didn't think there was a
requirement for everything to start/run at the same time, beyond the
possibly better convergence from the synchronization. But overall, does
this mean that one generally ought not to delete that file, and allow any
previous entries to remain in the case where you are restarting?

In regards to the last point about convergence, it does seem as though with
a high degree of synchronization, each replica tends to become trapped in
it's own particular region (based on plotting the colvars trajectory as you
say). I have, however, been pre-equilibrating the system by running with a
higher hill height, in order to roughly redistribute the replicas evenly
across the reaction coordinate, rather than biasing for some initial
configuration, and then starting the metadynamics run from scratch but with
the distributed starting coordinates. So even with a high degree of
synchronization, you think a good rule of thumb would be to ensure a
crossing event for each replica (or perhaps at least most of the replicas)?

~Aron

On Thu, Jul 5, 2012 at 6:25 PM, Giacomo Fiorin <giacomo.fiorin_at_gmail.com>wrote:

> Hi, Aron.
>
> So in the case you mentioned, where I run for 1ns or something, and then
>> start from the outputs and run another 1ns, I would change the output names
>> as I normally would, and that is all. And if I had a simulation crash in
>> the middle of a run, I would do much the same, but also change the
>> firsttimestep to be whatever it crashed at?
>>
>
> Changing firsttimestep wound't change the results of the run, but may be
> more useful for you to keep track of the crashed jobs.
>
>
>> Also, whenever I've been doing this, I've been deleting the
>> Replica_files.txt file that holds the information for where each replica
>> is, otherwise all the previous hill information gets loaded at the
>> beginning in addition to already loading the restart.state file. Is this
>> the right thing to do, or am I causing problems by being overly cautious?
>>
>
> I don't think this is a problem, for as long as all replicas restart at
> the same time, so that they have a chance to write their own entry in the
> .txt file asap. If some of the replicas resume running later, there is no
> way for the replicas who are already running to know that they even exist.
>
>
>> And finally, since the old *.hills file isn't being read, but rather, the
>> *.restart.colvars, this means that any hills written after the last
>> colvarsRestartFrequency are lost, correct? And assuming the previous is
>> true, does that mean that ideally one would have the
>> colvarsRestartFrequency = the main restartfreq (I suppose this applies to
>> more than just multiple replicas situations)?
>>
>
> You should have colvarsRestartFrequency equal to restartfreq most of the
> time anyway, which is why restartfreq is the default. A good reason to
> change it may be only to minimize I/O when the colvars.state file is not
> important (e.g. you're applying restraints that don't depend on time).
>
>
>> I'm finding the multiple replicas functionality quite ideal for leading
>> to better convergence (less hysteresis) without having to do a lot of
>> averaging or other things (I suppose the multiple replicas is it's own kind
>> of averaging, but I feel as though the communication between replicas makes
>> it a little better than just that... maybe I'm being overly optimistic).
>
>
> Well, I don't know about that. You should allow for all replicas to
> decorrelate, i.e. cross between different basin. If you initially set 5
> replicas in the global minimum, and 10 in a local minimum, and the two
> minima are next to each other, you may mistakenly assume that the local
> minimum is the global one. Only after some of the 10 replicas migrate to
> the local minimum and viceversa you could start checking whether the
> simulation is converged or not.
>
> A plot of the colvars trajectory of each replica should be able to tell.
>
> Bests
> Giacomo
>
>
>>
>>
>> ~Aron
>>
>>
>> On Thu, Jun 14, 2012 at 6:25 PM, Giacomo Fiorin <giacomo.fiorin_at_gmail.com
>> > wrote:
>>
>>> First question: no, keep the same replicaIDs, but change the value of
>>> the main NAMD configuration outputName option for each of them. I presume
>>> you're already doing this, otherwise the new jobs will overwrite the
>>> previous data, and you'll lose all the previous trajectory. In short, just
>>> set numsteps to 1000000 and don't change anything else.
>>>
>>> Yes about the ratio between replicaUpdateFrequency and newHillFrequency.
>>>
>>> I strongly recommend recompiling, for reasons that have nothing to do
>>> with the collective variables module. You're using a CUDA version, for
>>> which two components (the CUDA library itself and the NAMD support) are
>>> changing rather fast. Of course, 2.9 came out recently, so I don't know
>>> how big the changes in CUDA are in the CVS version (I'm sure others can
>>> fill in here). But compiling NAMD is a bit tough in the beginning to
>>> learn, but gets easier and not harder with every version, so the investment
>>> in time pays off big time.
>>>
>>> G.
>>>
>>> On Thu, Jun 14, 2012 at 6:12 PM, Aron Broom <broomsday_at_gmail.com> wrote:
>>>
>>>> Yes, the restart frequency is 2ns, I just had is as something rather
>>>> large because the jobs had generally not been failing.
>>>>
>>>> In terms of your suggestion about 1,000,000 steps, you mean that a new
>>>> job with a new output name should be started at that point? And then it
>>>> will be added into the Replica file automatically and start communicating
>>>> with everything else (and read in all the hills from the previous set of
>>>> jobs)? If I have 16 replicas, does that mean that for this new set of jobs
>>>> with the new output names, I'll want the replicaID to go from 17 through
>>>> 32, and then keep incrementing for the next round of restarts?
>>>>
>>>> I'll certainly reduce the replicaUpdateFrequency, as I was concerned
>>>> about that. Also, just so that I'm not confused, if the newHillFrequency
>>>> is 500 and the replicaUpdateFrequency is 1000, it updates every 2 hills?
>>>>
>>>> I hadn't yet compiled a version of NAMD with that patch as I had hit a
>>>> mental roadblock in terms of compiling (have always just used the
>>>> precompiled binaries). I have now started down the compiling path, but it
>>>> seems there are many places things can go wrong, but I appreciate that
>>>> having those hard boundaries would make things much better. Hopefully I'll
>>>> have it compiled in time to do a run if this current one suffers from the
>>>> same problems.
>>>>
>>>> Thanks for the all the suggestions.
>>>>
>>>> ~Aron
>>>>
>>>> On Thu, Jun 14, 2012 at 4:54 PM, Giacomo Fiorin <
>>>> giacomo.fiorin_at_gmail.com> wrote:
>>>>
>>>>> Hey, that's a very large value for restartFrequency! It's probably 2
>>>>> ns, right? At this point, I would suggest stopping a job after 1,000,000
>>>>> steps, write all the restart files, and start with a new job. The other
>>>>> replicas will then be forced to do a re-sync and read the new state file,
>>>>> which contains the complete llist.
>>>>>
>>>>> Also, at 0.01 s/step you're not super-fast, I think you can afford to
>>>>> bring replicaUpdateFrequency down to 1000 and keep more in sync.
>>>>>
>>>>> Yes, if grids are enabled the grids contain all the hills: the
>>>>> analytical hills are only for the event that you leave the grids'
>>>>> boundaries.
>>>>>
>>>>> Are you using the patch that I sent you to specifically eliminate the
>>>>> analytical hills if you don't need them?
>>>>>
>>>>> G.
>>>>>
>>>>>
>>>>> On Thu, Jun 14, 2012 at 4:47 PM, Aron Broom <broomsday_at_gmail.com>wrote:
>>>>>
>>>>>> Yes I was thinking to add larger hills less often and to also
>>>>>> increase the replicaUpdateFrequency.
>>>>>>
>>>>>> I killed the previous runs and restarted, but in a moment of extreme
>>>>>> stupidity, I didn't save the log files before clearing out the directories
>>>>>> for the restart, so I'm unable to search for that line. I will not make
>>>>>> that mistake again (I don't see that error happening in the logs for my 48
>>>>>> replica run that completed properly).
>>>>>>
>>>>>> The values for the replicas were (I've added commas for readability):
>>>>>>
>>>>>> newHillFrequency 500
>>>>>> replicaUpdateFrequency 10,000
>>>>>> restartFrequency 1,000,000
>>>>>>
>>>>>> The run was progressing at 0.01 s/step, so I guess that is ~5 seconds
>>>>>> per hill addition, and more importantly ~100 seconds per update (I have no
>>>>>> sense of how that compares against the time needed to read a file).
>>>>>>
>>>>>> I've made the rather minor change of the newHillFrequency going to
>>>>>> 1000 (and increased the hill size accordingly) in order to have fewer hills
>>>>>> being passed around and saved in that file, and I've increased the grid
>>>>>> boundaries substantially such that there are now 10 bin widths between the
>>>>>> walls and the boundaries. If this fails I will attempt your recommendation
>>>>>> of increasing the value of the hillfrequency and updatefrequency.
>>>>>>
>>>>>> In terms of the filesystem, the different nodes all share the same
>>>>>> filesystem. I'm not sure what the filesystem it is though, the OS is
>>>>>> CentOS. I can find out about this if it is useful.
>>>>>>
>>>>>> Thanks for the suggestions, I'll continue to look for that warning as
>>>>>> I check on things, hopefully it was just some random hardware glitch.
>>>>>>
>>>>>> One more question though, for the multiple walker stuff, all the
>>>>>> hills are saved analytically for each walker, and then when another walker
>>>>>> reads those, it adds that to it's own grid? So regardless of grid
>>>>>> boundaries, all the hill files grow over time? And I presume it takes
>>>>>> longer to access a larger file than it does a small one, so it is best to
>>>>>> have the fewest, largest hills that are still tolerable in terms of
>>>>>> accuracy?
>>>>>>
>>>>>> ~Aron
>>>>>>
>>>>>>
>>>>>> On Thu, Jun 14, 2012 at 4:20 PM, Giacomo Fiorin <
>>>>>> giacomo.fiorin_at_gmail.com> wrote:
>>>>>>
>>>>>>> I see.
>>>>>>>
>>>>>>> What you said in 4) sounds indeed suspicious: are the different
>>>>>>> nodes sharing the same filesystem through NFS? The same problems may
>>>>>>> responsible for 5), e.g. you don't get a read error but the filesystem
>>>>>>> doesn't catch up as often as it should. I didn't have very good experience
>>>>>>> with NFS filesystems.
>>>>>>>
>>>>>>> Indeed one thing you can try is to make the hills larger and add
>>>>>>> them less often. A good start would be to make them 16 times larger, and
>>>>>>> add them 16 times less often (make it 10 and 10 to not mess up with the
>>>>>>> restart frequency, of course).
>>>>>>>
>>>>>>> Or even better, keep the hills as they are, but increase instead
>>>>>>> replicaUpdateFrequency (so you give the replicas more time to empty their
>>>>>>> buffers).
>>>>>>>
>>>>>>> Btw, are you getting the following warning?
>>>>>>>
>>>>>>> Warning: in metadynamics bias "metadynamics1": failed to read
>>>>>>> completely output files from replica "xxx" ...
>>>>>>>
>>>>>>> Also, I'd like to know your exact values of
>>>>>>> newHillFrequency, replicaUpdateFrequency and restartFreq.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jun 14, 2012 at 3:56 PM, Aron Broom <broomsday_at_gmail.com>wrote:
>>>>>>>
>>>>>>>> So more details, some of which have me quite puzzled:
>>>>>>>>
>>>>>>>> 1) The run started with 16 replicas, all initialized fine, and all
>>>>>>>> hills were being shared between all replicas (as judged by the output
>>>>>>>> claiming to have received a hill from replica X)
>>>>>>>>
>>>>>>>> 2) At the same time both replicas 4 and 16 failed due to the error
>>>>>>>> mentioned. Also at the same time, replica 15 failed due to a different
>>>>>>>> error:
>>>>>>>>
>>>>>>>> colvars: Error: cannot read from file
>>>>>>>> "/work/broom/ThreeFoil_Galactose/GB_3D_Production/16_Replica_Run/Run_1/Meta_Galactose_GB_Run_100ns.colvars.metadynamics1.1.hills".
>>>>>>>> colvars: If this error message is unclear, try recompiling with
>>>>>>>> -DCOLVARS_DEBUG.
>>>>>>>> FATAL ERROR: Error in the collective variables module: exiting.
>>>>>>>>
>>>>>>>> 3) The remaining replicas have continued since then.
>>>>>>>>
>>>>>>>> 4) I have another 16 replica simulation running in a completely
>>>>>>>> different folder, using different nodes, and it also had 3 failures, and
>>>>>>>> they appear to be at least within the same minute based on the wallclock.
>>>>>>>> Maybe this suggests some kind of hardware problem that occurred at that
>>>>>>>> time?
>>>>>>>>
>>>>>>>> 5) The other thing I'm noticing is that hill updates from some
>>>>>>>> replicas that are still running seem to stop occurring for a long time, and
>>>>>>>> then a large chunk of them are added, with the message that X hills are
>>>>>>>> close to the grid boundaries and are being computed analytically. I see
>>>>>>>> the reason for this, but I'm wondering if perhaps that is partially to
>>>>>>>> blame in all of this, and I should increase my grid boundaries
>>>>>>>> substantially?
>>>>>>>>
>>>>>>>> 6) One last thing to note is that I recently had a 48 replica run
>>>>>>>> complete without trouble, although in terms of communication, each replica
>>>>>>>> only needed to get half as many hills, half as often.
>>>>>>>>
>>>>>>>> ~Aron
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jun 14, 2012 at 2:48 PM, Giacomo Fiorin <
>>>>>>>> giacomo.fiorin_at_gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Aron, indeed this is very interesting, I've never seen this
>>>>>>>>> low-level error message yet.
>>>>>>>>>
>>>>>>>>> Internally, the code that replica A catches an internal error when
>>>>>>>>> it can't read correctly a new block of data from replica B, because it was
>>>>>>>>> not completely written yet. Then, replica A moves on but saves the
>>>>>>>>> position in the file of replica B, and tries to read again at the next
>>>>>>>>> update.
>>>>>>>>>
>>>>>>>>> If the file has been written partially (i.e. it stopped in the
>>>>>>>>> middle of writing a number), you should get a warning that the above has
>>>>>>>>> happened and that the simulation continues.
>>>>>>>>>
>>>>>>>>> It looks like this is a lower-level error (you just can't get
>>>>>>>>> characters out of the files).
>>>>>>>>>
>>>>>>>>> How often has the error shown up? Has it occurred at all times
>>>>>>>>> since the beginning of the simulation, or was its occurrence only in
>>>>>>>>> certain periods?
>>>>>>>>>
>>>>>>>>> On Thu, Jun 14, 2012 at 2:26 PM, Aron Broom <broomsday_at_gmail.com>wrote:
>>>>>>>>>
>>>>>>>>>> I'm running multiple walker MetaDynamics, and for a few of the
>>>>>>>>>> replicas, after a random period of time, the run crashes with the following
>>>>>>>>>> error:
>>>>>>>>>>
>>>>>>>>>> terminate called after throwing an instance of
>>>>>>>>>> 'std::ios_base::failure'
>>>>>>>>>> what(): basic_filebuf::underflow error reading the file
>>>>>>>>>> /var/spool/torque/mom_priv/jobs/4378.mon240.monk.sharcnet.SC:
>>>>>>>>>> line 3: 30588 Aborted
>>>>>>>>>> (core dumped)
>>>>>>>>>> ../../../../NAMD/NAMD_2.9_Linux-x86_64-multicore-CUDA/namd2 +p4 +idlepoll
>>>>>>>>>> +mergegrids Galactose_Meta_Run.namd
>>>>>>>>>>
>>>>>>>>>> I suspect the last two lines are rather meaningless, but I
>>>>>>>>>> included them for completeness. I'm not sure, but I think this results
>>>>>>>>>> when replica A is attempting to read the hills from replica B while replica
>>>>>>>>>> B is adding new hills, or alternatively when two replicas are trying to
>>>>>>>>>> read hills from another replica at the same time. If that is the case,
>>>>>>>>>> then I suppose losing some synchronization between the replicas by
>>>>>>>>>> increasing the time between updates might help. But I'd ideally like to
>>>>>>>>>> avoid that, and was wondering if maybe this is a hardware or operating
>>>>>>>>>> system specific problem?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> ~Aron
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Aron Broom M.Sc
>>>>>>>>>> PhD Student
>>>>>>>>>> Department of Chemistry
>>>>>>>>>> University of Waterloo
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Aron Broom M.Sc
>>>>>>>> PhD Student
>>>>>>>> Department of Chemistry
>>>>>>>> University of Waterloo
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Aron Broom M.Sc
>>>>>> PhD Student
>>>>>> Department of Chemistry
>>>>>> University of Waterloo
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Aron Broom M.Sc
>>>> PhD Student
>>>> Department of Chemistry
>>>> University of Waterloo
>>>>
>>>>
>>>
>>
>>
>> --
>> Aron Broom M.Sc
>> PhD Student
>> Department of Chemistry
>> University of Waterloo
>>
>>
>

-- 
Aron Broom M.Sc
PhD Student
Department of Chemistry
University of Waterloo

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:45 CST