From: Giacomo Fiorin (giacomo.fiorin_at_gmail.com)
Date: Thu Jun 14 2012 - 15:20:04 CDT
I see.
What you said in 4) sounds indeed suspicious: are the different nodes
sharing the same filesystem through NFS? The same problems may responsible
for 5), e.g. you don't get a read error but the filesystem doesn't catch up
as often as it should. I didn't have very good experience with NFS
filesystems.
Indeed one thing you can try is to make the hills larger and add them less
often. A good start would be to make them 16 times larger, and add them 16
times less often (make it 10 and 10 to not mess up with the restart
frequency, of course).
Or even better, keep the hills as they are, but increase instead
replicaUpdateFrequency (so you give the replicas more time to empty their
buffers).
Btw, are you getting the following warning?
Warning: in metadynamics bias "metadynamics1": failed to read completely
output files from replica "xxx" ...
Also, I'd like to know your exact values of
newHillFrequency, replicaUpdateFrequency and restartFreq.
On Thu, Jun 14, 2012 at 3:56 PM, Aron Broom <broomsday_at_gmail.com> wrote:
> So more details, some of which have me quite puzzled:
>
> 1) The run started with 16 replicas, all initialized fine, and all hills
> were being shared between all replicas (as judged by the output claiming to
> have received a hill from replica X)
>
> 2) At the same time both replicas 4 and 16 failed due to the error
> mentioned. Also at the same time, replica 15 failed due to a different
> error:
>
> colvars: Error: cannot read from file
> "/work/broom/ThreeFoil_Galactose/GB_3D_Production/16_Replica_Run/Run_1/Meta_Galactose_GB_Run_100ns.colvars.metadynamics1.1.hills".
> colvars: If this error message is unclear, try recompiling with
> -DCOLVARS_DEBUG.
> FATAL ERROR: Error in the collective variables module: exiting.
>
> 3) The remaining replicas have continued since then.
>
> 4) I have another 16 replica simulation running in a completely different
> folder, using different nodes, and it also had 3 failures, and they appear
> to be at least within the same minute based on the wallclock. Maybe this
> suggests some kind of hardware problem that occurred at that time?
>
> 5) The other thing I'm noticing is that hill updates from some replicas
> that are still running seem to stop occurring for a long time, and then a
> large chunk of them are added, with the message that X hills are close to
> the grid boundaries and are being computed analytically. I see the reason
> for this, but I'm wondering if perhaps that is partially to blame in all of
> this, and I should increase my grid boundaries substantially?
>
> 6) One last thing to note is that I recently had a 48 replica run complete
> without trouble, although in terms of communication, each replica only
> needed to get half as many hills, half as often.
>
> ~Aron
>
>
>
> On Thu, Jun 14, 2012 at 2:48 PM, Giacomo Fiorin <giacomo.fiorin_at_gmail.com>wrote:
>
>> Hi Aron, indeed this is very interesting, I've never seen this low-level
>> error message yet.
>>
>> Internally, the code that replica A catches an internal error when it
>> can't read correctly a new block of data from replica B, because it was not
>> completely written yet. Then, replica A moves on but saves the position in
>> the file of replica B, and tries to read again at the next update.
>>
>> If the file has been written partially (i.e. it stopped in the middle of
>> writing a number), you should get a warning that the above has happened and
>> that the simulation continues.
>>
>> It looks like this is a lower-level error (you just can't get characters
>> out of the files).
>>
>> How often has the error shown up? Has it occurred at all times since the
>> beginning of the simulation, or was its occurrence only in certain periods?
>>
>> On Thu, Jun 14, 2012 at 2:26 PM, Aron Broom <broomsday_at_gmail.com> wrote:
>>
>>> I'm running multiple walker MetaDynamics, and for a few of the replicas,
>>> after a random period of time, the run crashes with the following error:
>>>
>>> terminate called after throwing an instance of 'std::ios_base::failure'
>>> what(): basic_filebuf::underflow error reading the file
>>> /var/spool/torque/mom_priv/jobs/4378.mon240.monk.sharcnet.SC: line 3:
>>> 30588 Aborted
>>> (core dumped)
>>> ../../../../NAMD/NAMD_2.9_Linux-x86_64-multicore-CUDA/namd2 +p4 +idlepoll
>>> +mergegrids Galactose_Meta_Run.namd
>>>
>>> I suspect the last two lines are rather meaningless, but I included them
>>> for completeness. I'm not sure, but I think this results when replica A is
>>> attempting to read the hills from replica B while replica B is adding new
>>> hills, or alternatively when two replicas are trying to read hills from
>>> another replica at the same time. If that is the case, then I suppose
>>> losing some synchronization between the replicas by increasing the time
>>> between updates might help. But I'd ideally like to avoid that, and was
>>> wondering if maybe this is a hardware or operating system specific
>>> problem?
>>>
>>> Thanks,
>>>
>>> ~Aron
>>>
>>> --
>>> Aron Broom M.Sc
>>> PhD Student
>>> Department of Chemistry
>>> University of Waterloo
>>>
>>>
>>
>
>
> --
> Aron Broom M.Sc
> PhD Student
> Department of Chemistry
> University of Waterloo
>
>
This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:39 CST