From: Giacomo Fiorin (giacomo.fiorin_at_gmail.com)
Date: Thu Jun 14 2012 - 15:54:17 CDT
Hey, that's a very large value for restartFrequency! It's probably 2 ns,
right? At this point, I would suggest stopping a job after 1,000,000
steps, write all the restart files, and start with a new job. The other
replicas will then be forced to do a re-sync and read the new state file,
which contains the complete llist.
Also, at 0.01 s/step you're not super-fast, I think you can afford to
down to 1000 and keep more in sync.
Yes, if grids are enabled the grids contain all the hills: the analytical
hills are only for the event that you leave the grids' boundaries.
Are you using the patch that I sent you to specifically eliminate the
analytical hills if you don't need them?
On Thu, Jun 14, 2012 at 4:47 PM, Aron Broom <broomsday_at_gmail.com> wrote:
> Yes I was thinking to add larger hills less often and to also increase the
> I killed the previous runs and restarted, but in a moment of extreme
> stupidity, I didn't save the log files before clearing out the directories
> for the restart, so I'm unable to search for that line. I will not make
> that mistake again (I don't see that error happening in the logs for my 48
> replica run that completed properly).
> The values for the replicas were (I've added commas for readability):
> newHillFrequency 500
> replicaUpdateFrequency 10,000
> restartFrequency 1,000,000
> The run was progressing at 0.01 s/step, so I guess that is ~5 seconds per
> hill addition, and more importantly ~100 seconds per update (I have no
> sense of how that compares against the time needed to read a file).
> I've made the rather minor change of the newHillFrequency going to 1000
> (and increased the hill size accordingly) in order to have fewer hills
> being passed around and saved in that file, and I've increased the grid
> boundaries substantially such that there are now 10 bin widths between the
> walls and the boundaries. If this fails I will attempt your recommendation
> of increasing the value of the hillfrequency and updatefrequency.
> In terms of the filesystem, the different nodes all share the same
> filesystem. I'm not sure what the filesystem it is though, the OS is
> CentOS. I can find out about this if it is useful.
> Thanks for the suggestions, I'll continue to look for that warning as I
> check on things, hopefully it was just some random hardware glitch.
> One more question though, for the multiple walker stuff, all the hills are
> saved analytically for each walker, and then when another walker reads
> those, it adds that to it's own grid? So regardless of grid boundaries,
> all the hill files grow over time? And I presume it takes longer to access
> a larger file than it does a small one, so it is best to have the fewest,
> largest hills that are still tolerable in terms of accuracy?
> On Thu, Jun 14, 2012 at 4:20 PM, Giacomo Fiorin <giacomo.fiorin_at_gmail.com>wrote:
>> I see.
>> What you said in 4) sounds indeed suspicious: are the different nodes
>> sharing the same filesystem through NFS? The same problems may responsible
>> for 5), e.g. you don't get a read error but the filesystem doesn't catch up
>> as often as it should. I didn't have very good experience with NFS
>> Indeed one thing you can try is to make the hills larger and add them
>> less often. A good start would be to make them 16 times larger, and add
>> them 16 times less often (make it 10 and 10 to not mess up with the restart
>> frequency, of course).
>> Or even better, keep the hills as they are, but increase instead
>> replicaUpdateFrequency (so you give the replicas more time to empty their
>> Btw, are you getting the following warning?
>> Warning: in metadynamics bias "metadynamics1": failed to read completely
>> output files from replica "xxx" ...
>> Also, I'd like to know your exact values of
>> newHillFrequency, replicaUpdateFrequency and restartFreq.
>> On Thu, Jun 14, 2012 at 3:56 PM, Aron Broom <broomsday_at_gmail.com> wrote:
>>> So more details, some of which have me quite puzzled:
>>> 1) The run started with 16 replicas, all initialized fine, and all hills
>>> were being shared between all replicas (as judged by the output claiming to
>>> have received a hill from replica X)
>>> 2) At the same time both replicas 4 and 16 failed due to the error
>>> mentioned. Also at the same time, replica 15 failed due to a different
>>> colvars: Error: cannot read from file
>>> colvars: If this error message is unclear, try recompiling with
>>> FATAL ERROR: Error in the collective variables module: exiting.
>>> 3) The remaining replicas have continued since then.
>>> 4) I have another 16 replica simulation running in a completely
>>> different folder, using different nodes, and it also had 3 failures, and
>>> they appear to be at least within the same minute based on the wallclock.
>>> Maybe this suggests some kind of hardware problem that occurred at that
>>> 5) The other thing I'm noticing is that hill updates from some replicas
>>> that are still running seem to stop occurring for a long time, and then a
>>> large chunk of them are added, with the message that X hills are close to
>>> the grid boundaries and are being computed analytically. I see the reason
>>> for this, but I'm wondering if perhaps that is partially to blame in all of
>>> this, and I should increase my grid boundaries substantially?
>>> 6) One last thing to note is that I recently had a 48 replica run
>>> complete without trouble, although in terms of communication, each replica
>>> only needed to get half as many hills, half as often.
>>> On Thu, Jun 14, 2012 at 2:48 PM, Giacomo Fiorin <
>>> giacomo.fiorin_at_gmail.com> wrote:
>>>> Hi Aron, indeed this is very interesting, I've never seen this
>>>> low-level error message yet.
>>>> Internally, the code that replica A catches an internal error when it
>>>> can't read correctly a new block of data from replica B, because it was not
>>>> completely written yet. Then, replica A moves on but saves the position in
>>>> the file of replica B, and tries to read again at the next update.
>>>> If the file has been written partially (i.e. it stopped in the middle
>>>> of writing a number), you should get a warning that the above has happened
>>>> and that the simulation continues.
>>>> It looks like this is a lower-level error (you just can't get
>>>> characters out of the files).
>>>> How often has the error shown up? Has it occurred at all times since
>>>> the beginning of the simulation, or was its occurrence only in certain
>>>> On Thu, Jun 14, 2012 at 2:26 PM, Aron Broom <broomsday_at_gmail.com>wrote:
>>>>> I'm running multiple walker MetaDynamics, and for a few of the
>>>>> replicas, after a random period of time, the run crashes with the following
>>>>> terminate called after throwing an instance of 'std::ios_base::failure'
>>>>> what(): basic_filebuf::underflow error reading the file
>>>>> /var/spool/torque/mom_priv/jobs/4378.mon240.monk.sharcnet.SC: line 3:
>>>>> 30588 Aborted
>>>>> (core dumped)
>>>>> ../../../../NAMD/NAMD_2.9_Linux-x86_64-multicore-CUDA/namd2 +p4 +idlepoll
>>>>> +mergegrids Galactose_Meta_Run.namd
>>>>> I suspect the last two lines are rather meaningless, but I included
>>>>> them for completeness. I'm not sure, but I think this results when replica
>>>>> A is attempting to read the hills from replica B while replica B is adding
>>>>> new hills, or alternatively when two replicas are trying to read hills from
>>>>> another replica at the same time. If that is the case, then I suppose
>>>>> losing some synchronization between the replicas by increasing the time
>>>>> between updates might help. But I'd ideally like to avoid that, and was
>>>>> wondering if maybe this is a hardware or operating system specific
>>>>> Aron Broom M.Sc
>>>>> PhD Student
>>>>> Department of Chemistry
>>>>> University of Waterloo
>>> Aron Broom M.Sc
>>> PhD Student
>>> Department of Chemistry
>>> University of Waterloo
> Aron Broom M.Sc
> PhD Student
> Department of Chemistry
> University of Waterloo
This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:22:07 CST