Re: Buffer Underflow in Multiple Walker MetaDynamics

From: Aron Broom (broomsday_at_gmail.com)
Date: Thu Jun 14 2012 - 15:47:10 CDT

Yes I was thinking to add larger hills less often and to also increase the
replicaUpdateFrequency.

I killed the previous runs and restarted, but in a moment of extreme
stupidity, I didn't save the log files before clearing out the directories
for the restart, so I'm unable to search for that line. I will not make
that mistake again (I don't see that error happening in the logs for my 48
replica run that completed properly).

The values for the replicas were (I've added commas for readability):

newHillFrequency 500
replicaUpdateFrequency 10,000
restartFrequency 1,000,000

The run was progressing at 0.01 s/step, so I guess that is ~5 seconds per
hill addition, and more importantly ~100 seconds per update (I have no
sense of how that compares against the time needed to read a file).

I've made the rather minor change of the newHillFrequency going to 1000
(and increased the hill size accordingly) in order to have fewer hills
being passed around and saved in that file, and I've increased the grid
boundaries substantially such that there are now 10 bin widths between the
walls and the boundaries. If this fails I will attempt your recommendation
of increasing the value of the hillfrequency and updatefrequency.

In terms of the filesystem, the different nodes all share the same
filesystem. I'm not sure what the filesystem it is though, the OS is
CentOS. I can find out about this if it is useful.

Thanks for the suggestions, I'll continue to look for that warning as I
check on things, hopefully it was just some random hardware glitch.

One more question though, for the multiple walker stuff, all the hills are
saved analytically for each walker, and then when another walker reads
those, it adds that to it's own grid? So regardless of grid boundaries,
all the hill files grow over time? And I presume it takes longer to access
a larger file than it does a small one, so it is best to have the fewest,
largest hills that are still tolerable in terms of accuracy?

~Aron

On Thu, Jun 14, 2012 at 4:20 PM, Giacomo Fiorin <giacomo.fiorin_at_gmail.com>wrote:

> I see.
>
> What you said in 4) sounds indeed suspicious: are the different nodes
> sharing the same filesystem through NFS? The same problems may responsible
> for 5), e.g. you don't get a read error but the filesystem doesn't catch up
> as often as it should. I didn't have very good experience with NFS
> filesystems.
>
> Indeed one thing you can try is to make the hills larger and add them less
> often. A good start would be to make them 16 times larger, and add them 16
> times less often (make it 10 and 10 to not mess up with the restart
> frequency, of course).
>
> Or even better, keep the hills as they are, but increase instead
> replicaUpdateFrequency (so you give the replicas more time to empty their
> buffers).
>
> Btw, are you getting the following warning?
>
> Warning: in metadynamics bias "metadynamics1": failed to read completely
> output files from replica "xxx" ...
>
> Also, I'd like to know your exact values of
> newHillFrequency, replicaUpdateFrequency and restartFreq.
>
>
>
> On Thu, Jun 14, 2012 at 3:56 PM, Aron Broom <broomsday_at_gmail.com> wrote:
>
>> So more details, some of which have me quite puzzled:
>>
>> 1) The run started with 16 replicas, all initialized fine, and all hills
>> were being shared between all replicas (as judged by the output claiming to
>> have received a hill from replica X)
>>
>> 2) At the same time both replicas 4 and 16 failed due to the error
>> mentioned. Also at the same time, replica 15 failed due to a different
>> error:
>>
>> colvars: Error: cannot read from file
>> "/work/broom/ThreeFoil_Galactose/GB_3D_Production/16_Replica_Run/Run_1/Meta_Galactose_GB_Run_100ns.colvars.metadynamics1.1.hills".
>> colvars: If this error message is unclear, try recompiling with
>> -DCOLVARS_DEBUG.
>> FATAL ERROR: Error in the collective variables module: exiting.
>>
>> 3) The remaining replicas have continued since then.
>>
>> 4) I have another 16 replica simulation running in a completely different
>> folder, using different nodes, and it also had 3 failures, and they appear
>> to be at least within the same minute based on the wallclock. Maybe this
>> suggests some kind of hardware problem that occurred at that time?
>>
>> 5) The other thing I'm noticing is that hill updates from some replicas
>> that are still running seem to stop occurring for a long time, and then a
>> large chunk of them are added, with the message that X hills are close to
>> the grid boundaries and are being computed analytically. I see the reason
>> for this, but I'm wondering if perhaps that is partially to blame in all of
>> this, and I should increase my grid boundaries substantially?
>>
>> 6) One last thing to note is that I recently had a 48 replica run
>> complete without trouble, although in terms of communication, each replica
>> only needed to get half as many hills, half as often.
>>
>> ~Aron
>>
>>
>>
>> On Thu, Jun 14, 2012 at 2:48 PM, Giacomo Fiorin <giacomo.fiorin_at_gmail.com
>> > wrote:
>>
>>> Hi Aron, indeed this is very interesting, I've never seen this low-level
>>> error message yet.
>>>
>>> Internally, the code that replica A catches an internal error when it
>>> can't read correctly a new block of data from replica B, because it was not
>>> completely written yet. Then, replica A moves on but saves the position in
>>> the file of replica B, and tries to read again at the next update.
>>>
>>> If the file has been written partially (i.e. it stopped in the middle of
>>> writing a number), you should get a warning that the above has happened and
>>> that the simulation continues.
>>>
>>> It looks like this is a lower-level error (you just can't get
>>> characters out of the files).
>>>
>>> How often has the error shown up? Has it occurred at all times since
>>> the beginning of the simulation, or was its occurrence only in certain
>>> periods?
>>>
>>> On Thu, Jun 14, 2012 at 2:26 PM, Aron Broom <broomsday_at_gmail.com> wrote:
>>>
>>>> I'm running multiple walker MetaDynamics, and for a few of the
>>>> replicas, after a random period of time, the run crashes with the following
>>>> error:
>>>>
>>>> terminate called after throwing an instance of 'std::ios_base::failure'
>>>> what(): basic_filebuf::underflow error reading the file
>>>> /var/spool/torque/mom_priv/jobs/4378.mon240.monk.sharcnet.SC: line 3:
>>>> 30588 Aborted
>>>> (core dumped)
>>>> ../../../../NAMD/NAMD_2.9_Linux-x86_64-multicore-CUDA/namd2 +p4 +idlepoll
>>>> +mergegrids Galactose_Meta_Run.namd
>>>>
>>>> I suspect the last two lines are rather meaningless, but I included
>>>> them for completeness. I'm not sure, but I think this results when replica
>>>> A is attempting to read the hills from replica B while replica B is adding
>>>> new hills, or alternatively when two replicas are trying to read hills from
>>>> another replica at the same time. If that is the case, then I suppose
>>>> losing some synchronization between the replicas by increasing the time
>>>> between updates might help. But I'd ideally like to avoid that, and was
>>>> wondering if maybe this is a hardware or operating system specific
>>>> problem?
>>>>
>>>> Thanks,
>>>>
>>>> ~Aron
>>>>
>>>> --
>>>> Aron Broom M.Sc
>>>> PhD Student
>>>> Department of Chemistry
>>>> University of Waterloo
>>>>
>>>>
>>>
>>
>>
>> --
>> Aron Broom M.Sc
>> PhD Student
>> Department of Chemistry
>> University of Waterloo
>>
>>
>

-- 
Aron Broom M.Sc
PhD Student
Department of Chemistry
University of Waterloo

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:39 CST