From: Giacomo Fiorin (giacomo.fiorin_at_gmail.com)
Date: Thu Jun 14 2012 - 15:54:17 CDT
Hey, that's a very large value for restartFrequency!  It's probably 2 ns,
right?  At this point, I would suggest stopping a job after 1,000,000
steps, write all the restart files, and start with a new job.  The other
replicas will then be forced to do a re-sync and read the new state file,
which contains the complete llist.
Also, at 0.01 s/step you're not super-fast, I think you can afford to
bring replicaUpdateFrequency
down to 1000 and keep more in sync.
Yes, if grids are enabled the grids contain all the hills: the analytical
hills are only for the event that you leave the grids' boundaries.
Are you using the patch that I sent you to specifically eliminate the
analytical hills if you don't need them?
G.
On Thu, Jun 14, 2012 at 4:47 PM, Aron Broom <broomsday_at_gmail.com> wrote:
> Yes I was thinking to add larger hills less often and to also increase the
> replicaUpdateFrequency.
>
> I killed the previous runs and restarted, but in a moment of extreme
> stupidity, I didn't save the log files before clearing out the directories
> for the restart, so I'm unable to search for that line.  I will not make
> that mistake again (I don't see that error happening in the logs for my 48
> replica run that completed properly).
>
> The values for the replicas were (I've added commas for readability):
>
> newHillFrequency 500
> replicaUpdateFrequency 10,000
> restartFrequency 1,000,000
>
> The run was progressing at 0.01 s/step, so I guess that is ~5 seconds per
> hill addition, and more importantly ~100 seconds per update (I have no
> sense of how that compares against the time needed to read a file).
>
> I've made the rather minor change of the newHillFrequency going to 1000
> (and increased the hill size accordingly) in order to have fewer hills
> being passed around and saved in that file, and I've increased the grid
> boundaries substantially such that there are now 10 bin widths between the
> walls and the boundaries.  If this fails I will attempt your recommendation
> of increasing the value of the hillfrequency and updatefrequency.
>
> In terms of the filesystem, the different nodes all share the same
> filesystem.  I'm not sure what the filesystem it is though, the OS is
> CentOS.  I can find out about this if it is useful.
>
> Thanks for the suggestions, I'll continue to look for that warning as I
> check on things, hopefully it was just some random hardware glitch.
>
> One more question though, for the multiple walker stuff, all the hills are
> saved analytically for each walker, and then when another walker reads
> those, it adds that to it's own grid?  So regardless of grid boundaries,
> all the hill files grow over time?  And I presume it takes longer to access
> a larger file than it does a small one, so it is best to have the fewest,
> largest hills that are still tolerable in terms of accuracy?
>
> ~Aron
>
>
> On Thu, Jun 14, 2012 at 4:20 PM, Giacomo Fiorin <giacomo.fiorin_at_gmail.com>wrote:
>
>> I see.
>>
>> What you said in 4) sounds indeed suspicious: are the different nodes
>> sharing the same filesystem through NFS?  The same problems may responsible
>> for 5), e.g. you don't get a read error but the filesystem doesn't catch up
>> as often as it should.  I didn't have very good experience with NFS
>> filesystems.
>>
>> Indeed one thing you can try is to make the hills larger and add them
>> less often.  A good start would be to make them 16 times larger, and add
>> them 16 times less often (make it 10 and 10 to not mess up with the restart
>> frequency, of course).
>>
>> Or even better, keep the hills as they are, but increase instead
>> replicaUpdateFrequency (so you give the replicas more time to empty their
>> buffers).
>>
>> Btw, are you getting the following warning?
>>
>> Warning: in metadynamics bias "metadynamics1": failed to read completely
>> output files from replica "xxx" ...
>>
>> Also, I'd like to know your exact values of
>> newHillFrequency, replicaUpdateFrequency and restartFreq.
>>
>>
>>
>> On Thu, Jun 14, 2012 at 3:56 PM, Aron Broom <broomsday_at_gmail.com> wrote:
>>
>>> So more details, some of which have me quite puzzled:
>>>
>>> 1) The run started with 16 replicas, all initialized fine, and all hills
>>> were being shared between all replicas (as judged by the output claiming to
>>> have received a hill from replica X)
>>>
>>> 2) At the same time both replicas 4 and 16 failed due to the error
>>> mentioned.  Also at the same time, replica 15 failed due to a different
>>> error:
>>>
>>> colvars:   Error: cannot read from file
>>> "/work/broom/ThreeFoil_Galactose/GB_3D_Production/16_Replica_Run/Run_1/Meta_Galactose_GB_Run_100ns.colvars.metadynamics1.1.hills".
>>> colvars:   If this error message is unclear, try recompiling with
>>> -DCOLVARS_DEBUG.
>>> FATAL ERROR: Error in the collective variables module: exiting.
>>>
>>> 3) The remaining replicas have continued since then.
>>>
>>> 4) I have another 16 replica simulation running in a completely
>>> different folder, using different nodes, and it also had 3 failures, and
>>> they appear to be at least within the same minute based on the wallclock.
>>> Maybe this suggests some kind of hardware problem that occurred at that
>>> time?
>>>
>>> 5) The other thing I'm noticing is that hill updates from some replicas
>>> that are still running seem to stop occurring for a long time, and then a
>>> large chunk of them are added, with the message that X hills are close to
>>> the grid boundaries and are being computed analytically.  I see the reason
>>> for this, but I'm wondering if perhaps that is partially to blame in all of
>>> this, and I should increase my grid boundaries substantially?
>>>
>>> 6) One last thing to note is that I recently had a 48 replica run
>>> complete without trouble, although in terms of communication, each replica
>>> only needed to get half as many hills, half as often.
>>>
>>> ~Aron
>>>
>>>
>>>
>>> On Thu, Jun 14, 2012 at 2:48 PM, Giacomo Fiorin <
>>> giacomo.fiorin_at_gmail.com> wrote:
>>>
>>>> Hi Aron, indeed this is very interesting, I've never seen this
>>>> low-level error message yet.
>>>>
>>>> Internally, the code that replica A catches an internal error when it
>>>> can't read correctly a new block of data from replica B, because it was not
>>>> completely written yet.  Then, replica A moves on but saves the position in
>>>> the file of replica B, and tries to read again at the next update.
>>>>
>>>> If the file has been written partially (i.e. it stopped in the middle
>>>> of writing a number), you should get a warning that the above has happened
>>>> and that the simulation continues.
>>>>
>>>>  It looks like this is a lower-level error (you just can't get
>>>> characters out of the files).
>>>>
>>>> How often has the error shown up?  Has it occurred at all times since
>>>> the beginning of the simulation, or was its occurrence only in certain
>>>> periods?
>>>>
>>>> On Thu, Jun 14, 2012 at 2:26 PM, Aron Broom <broomsday_at_gmail.com>wrote:
>>>>
>>>>> I'm running multiple walker MetaDynamics, and for a few of the
>>>>> replicas, after a random period of time, the run crashes with the following
>>>>> error:
>>>>>
>>>>> terminate called after throwing an instance of 'std::ios_base::failure'
>>>>>   what():  basic_filebuf::underflow error reading the file
>>>>> /var/spool/torque/mom_priv/jobs/4378.mon240.monk.sharcnet.SC: line 3:
>>>>> 30588 Aborted
>>>>> (core dumped)
>>>>> ../../../../NAMD/NAMD_2.9_Linux-x86_64-multicore-CUDA/namd2 +p4 +idlepoll
>>>>> +mergegrids Galactose_Meta_Run.namd
>>>>>
>>>>> I suspect the last two lines are rather meaningless, but I included
>>>>> them for completeness.  I'm not sure, but I think this results when replica
>>>>> A is attempting to read the hills from replica B while replica B is adding
>>>>> new hills, or alternatively when two replicas are trying to read hills from
>>>>> another replica at the same time.  If that is the case, then I suppose
>>>>> losing some synchronization between the replicas by increasing the time
>>>>> between updates might help.  But I'd ideally like to avoid that, and was
>>>>> wondering if maybe this is a hardware or operating system specific
>>>>> problem?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> ~Aron
>>>>>
>>>>> --
>>>>> Aron Broom M.Sc
>>>>> PhD Student
>>>>> Department of Chemistry
>>>>> University of Waterloo
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Aron Broom M.Sc
>>> PhD Student
>>> Department of Chemistry
>>> University of Waterloo
>>>
>>>
>>
>
>
> --
> Aron Broom M.Sc
> PhD Student
> Department of Chemistry
> University of Waterloo
>
>
This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:39 CST