Re: Buffer Underflow in Multiple Walker MetaDynamics

From: Giacomo Fiorin (giacomo.fiorin_at_gmail.com)
Date: Thu Jun 14 2012 - 17:25:26 CDT

First question: no, keep the same replicaIDs, but change the value of the
main NAMD configuration outputName option for each of them. I presume
you're already doing this, otherwise the new jobs will overwrite the
previous data, and you'll lose all the previous trajectory. In short, just
set numsteps to 1000000 and don't change anything else.

Yes about the ratio between replicaUpdateFrequency and newHillFrequency.

I strongly recommend recompiling, for reasons that have nothing to do with
the collective variables module. You're using a CUDA version, for which
two components (the CUDA library itself and the NAMD support) are changing
rather fast. Of course, 2.9 came out recently, so I don't know how big the
changes in CUDA are in the CVS version (I'm sure others can fill in here).
 But compiling NAMD is a bit tough in the beginning to learn, but gets
easier and not harder with every version, so the investment in time pays
off big time.

G.

On Thu, Jun 14, 2012 at 6:12 PM, Aron Broom <broomsday_at_gmail.com> wrote:

> Yes, the restart frequency is 2ns, I just had is as something rather large
> because the jobs had generally not been failing.
>
> In terms of your suggestion about 1,000,000 steps, you mean that a new job
> with a new output name should be started at that point? And then it will
> be added into the Replica file automatically and start communicating with
> everything else (and read in all the hills from the previous set of jobs)?
> If I have 16 replicas, does that mean that for this new set of jobs with
> the new output names, I'll want the replicaID to go from 17 through 32, and
> then keep incrementing for the next round of restarts?
>
> I'll certainly reduce the replicaUpdateFrequency, as I was concerned about
> that. Also, just so that I'm not confused, if the newHillFrequency is 500
> and the replicaUpdateFrequency is 1000, it updates every 2 hills?
>
> I hadn't yet compiled a version of NAMD with that patch as I had hit a
> mental roadblock in terms of compiling (have always just used the
> precompiled binaries). I have now started down the compiling path, but it
> seems there are many places things can go wrong, but I appreciate that
> having those hard boundaries would make things much better. Hopefully I'll
> have it compiled in time to do a run if this current one suffers from the
> same problems.
>
> Thanks for the all the suggestions.
>
> ~Aron
>
> On Thu, Jun 14, 2012 at 4:54 PM, Giacomo Fiorin <giacomo.fiorin_at_gmail.com>wrote:
>
>> Hey, that's a very large value for restartFrequency! It's probably 2 ns,
>> right? At this point, I would suggest stopping a job after 1,000,000
>> steps, write all the restart files, and start with a new job. The other
>> replicas will then be forced to do a re-sync and read the new state file,
>> which contains the complete llist.
>>
>> Also, at 0.01 s/step you're not super-fast, I think you can afford to
>> bring replicaUpdateFrequency down to 1000 and keep more in sync.
>>
>> Yes, if grids are enabled the grids contain all the hills: the analytical
>> hills are only for the event that you leave the grids' boundaries.
>>
>> Are you using the patch that I sent you to specifically eliminate the
>> analytical hills if you don't need them?
>>
>> G.
>>
>>
>> On Thu, Jun 14, 2012 at 4:47 PM, Aron Broom <broomsday_at_gmail.com> wrote:
>>
>>> Yes I was thinking to add larger hills less often and to also increase
>>> the replicaUpdateFrequency.
>>>
>>> I killed the previous runs and restarted, but in a moment of extreme
>>> stupidity, I didn't save the log files before clearing out the directories
>>> for the restart, so I'm unable to search for that line. I will not make
>>> that mistake again (I don't see that error happening in the logs for my 48
>>> replica run that completed properly).
>>>
>>> The values for the replicas were (I've added commas for readability):
>>>
>>> newHillFrequency 500
>>> replicaUpdateFrequency 10,000
>>> restartFrequency 1,000,000
>>>
>>> The run was progressing at 0.01 s/step, so I guess that is ~5 seconds
>>> per hill addition, and more importantly ~100 seconds per update (I have no
>>> sense of how that compares against the time needed to read a file).
>>>
>>> I've made the rather minor change of the newHillFrequency going to 1000
>>> (and increased the hill size accordingly) in order to have fewer hills
>>> being passed around and saved in that file, and I've increased the grid
>>> boundaries substantially such that there are now 10 bin widths between the
>>> walls and the boundaries. If this fails I will attempt your recommendation
>>> of increasing the value of the hillfrequency and updatefrequency.
>>>
>>> In terms of the filesystem, the different nodes all share the same
>>> filesystem. I'm not sure what the filesystem it is though, the OS is
>>> CentOS. I can find out about this if it is useful.
>>>
>>> Thanks for the suggestions, I'll continue to look for that warning as I
>>> check on things, hopefully it was just some random hardware glitch.
>>>
>>> One more question though, for the multiple walker stuff, all the hills
>>> are saved analytically for each walker, and then when another walker reads
>>> those, it adds that to it's own grid? So regardless of grid boundaries,
>>> all the hill files grow over time? And I presume it takes longer to access
>>> a larger file than it does a small one, so it is best to have the fewest,
>>> largest hills that are still tolerable in terms of accuracy?
>>>
>>> ~Aron
>>>
>>>
>>> On Thu, Jun 14, 2012 at 4:20 PM, Giacomo Fiorin <
>>> giacomo.fiorin_at_gmail.com> wrote:
>>>
>>>> I see.
>>>>
>>>> What you said in 4) sounds indeed suspicious: are the different nodes
>>>> sharing the same filesystem through NFS? The same problems may responsible
>>>> for 5), e.g. you don't get a read error but the filesystem doesn't catch up
>>>> as often as it should. I didn't have very good experience with NFS
>>>> filesystems.
>>>>
>>>> Indeed one thing you can try is to make the hills larger and add them
>>>> less often. A good start would be to make them 16 times larger, and add
>>>> them 16 times less often (make it 10 and 10 to not mess up with the restart
>>>> frequency, of course).
>>>>
>>>> Or even better, keep the hills as they are, but increase instead
>>>> replicaUpdateFrequency (so you give the replicas more time to empty their
>>>> buffers).
>>>>
>>>> Btw, are you getting the following warning?
>>>>
>>>> Warning: in metadynamics bias "metadynamics1": failed to read
>>>> completely output files from replica "xxx" ...
>>>>
>>>> Also, I'd like to know your exact values of
>>>> newHillFrequency, replicaUpdateFrequency and restartFreq.
>>>>
>>>>
>>>>
>>>> On Thu, Jun 14, 2012 at 3:56 PM, Aron Broom <broomsday_at_gmail.com>wrote:
>>>>
>>>>> So more details, some of which have me quite puzzled:
>>>>>
>>>>> 1) The run started with 16 replicas, all initialized fine, and all
>>>>> hills were being shared between all replicas (as judged by the output
>>>>> claiming to have received a hill from replica X)
>>>>>
>>>>> 2) At the same time both replicas 4 and 16 failed due to the error
>>>>> mentioned. Also at the same time, replica 15 failed due to a different
>>>>> error:
>>>>>
>>>>> colvars: Error: cannot read from file
>>>>> "/work/broom/ThreeFoil_Galactose/GB_3D_Production/16_Replica_Run/Run_1/Meta_Galactose_GB_Run_100ns.colvars.metadynamics1.1.hills".
>>>>> colvars: If this error message is unclear, try recompiling with
>>>>> -DCOLVARS_DEBUG.
>>>>> FATAL ERROR: Error in the collective variables module: exiting.
>>>>>
>>>>> 3) The remaining replicas have continued since then.
>>>>>
>>>>> 4) I have another 16 replica simulation running in a completely
>>>>> different folder, using different nodes, and it also had 3 failures, and
>>>>> they appear to be at least within the same minute based on the wallclock.
>>>>> Maybe this suggests some kind of hardware problem that occurred at that
>>>>> time?
>>>>>
>>>>> 5) The other thing I'm noticing is that hill updates from some
>>>>> replicas that are still running seem to stop occurring for a long time, and
>>>>> then a large chunk of them are added, with the message that X hills are
>>>>> close to the grid boundaries and are being computed analytically. I see
>>>>> the reason for this, but I'm wondering if perhaps that is partially to
>>>>> blame in all of this, and I should increase my grid boundaries
>>>>> substantially?
>>>>>
>>>>> 6) One last thing to note is that I recently had a 48 replica run
>>>>> complete without trouble, although in terms of communication, each replica
>>>>> only needed to get half as many hills, half as often.
>>>>>
>>>>> ~Aron
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jun 14, 2012 at 2:48 PM, Giacomo Fiorin <
>>>>> giacomo.fiorin_at_gmail.com> wrote:
>>>>>
>>>>>> Hi Aron, indeed this is very interesting, I've never seen this
>>>>>> low-level error message yet.
>>>>>>
>>>>>> Internally, the code that replica A catches an internal error when it
>>>>>> can't read correctly a new block of data from replica B, because it was not
>>>>>> completely written yet. Then, replica A moves on but saves the position in
>>>>>> the file of replica B, and tries to read again at the next update.
>>>>>>
>>>>>> If the file has been written partially (i.e. it stopped in the middle
>>>>>> of writing a number), you should get a warning that the above has happened
>>>>>> and that the simulation continues.
>>>>>>
>>>>>> It looks like this is a lower-level error (you just can't get
>>>>>> characters out of the files).
>>>>>>
>>>>>> How often has the error shown up? Has it occurred at all times since
>>>>>> the beginning of the simulation, or was its occurrence only in certain
>>>>>> periods?
>>>>>>
>>>>>> On Thu, Jun 14, 2012 at 2:26 PM, Aron Broom <broomsday_at_gmail.com>wrote:
>>>>>>
>>>>>>> I'm running multiple walker MetaDynamics, and for a few of the
>>>>>>> replicas, after a random period of time, the run crashes with the following
>>>>>>> error:
>>>>>>>
>>>>>>> terminate called after throwing an instance of
>>>>>>> 'std::ios_base::failure'
>>>>>>> what(): basic_filebuf::underflow error reading the file
>>>>>>> /var/spool/torque/mom_priv/jobs/4378.mon240.monk.sharcnet.SC: line
>>>>>>> 3: 30588 Aborted
>>>>>>> (core dumped)
>>>>>>> ../../../../NAMD/NAMD_2.9_Linux-x86_64-multicore-CUDA/namd2 +p4 +idlepoll
>>>>>>> +mergegrids Galactose_Meta_Run.namd
>>>>>>>
>>>>>>> I suspect the last two lines are rather meaningless, but I included
>>>>>>> them for completeness. I'm not sure, but I think this results when replica
>>>>>>> A is attempting to read the hills from replica B while replica B is adding
>>>>>>> new hills, or alternatively when two replicas are trying to read hills from
>>>>>>> another replica at the same time. If that is the case, then I suppose
>>>>>>> losing some synchronization between the replicas by increasing the time
>>>>>>> between updates might help. But I'd ideally like to avoid that, and was
>>>>>>> wondering if maybe this is a hardware or operating system specific
>>>>>>> problem?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> ~Aron
>>>>>>>
>>>>>>> --
>>>>>>> Aron Broom M.Sc
>>>>>>> PhD Student
>>>>>>> Department of Chemistry
>>>>>>> University of Waterloo
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Aron Broom M.Sc
>>>>> PhD Student
>>>>> Department of Chemistry
>>>>> University of Waterloo
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Aron Broom M.Sc
>>> PhD Student
>>> Department of Chemistry
>>> University of Waterloo
>>>
>>>
>>
>
>
> --
> Aron Broom M.Sc
> PhD Student
> Department of Chemistry
> University of Waterloo
>
>

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:22:08 CST