From: Aron Broom (broomsday_at_gmail.com)
Date: Thu Jun 14 2012 - 14:56:30 CDT
So more details, some of which have me quite puzzled:
1) The run started with 16 replicas, all initialized fine, and all hills
were being shared between all replicas (as judged by the output claiming to
have received a hill from replica X)
2) At the same time both replicas 4 and 16 failed due to the error
mentioned. Also at the same time, replica 15 failed due to a different
colvars: Error: cannot read from file
colvars: If this error message is unclear, try recompiling with
FATAL ERROR: Error in the collective variables module: exiting.
3) The remaining replicas have continued since then.
4) I have another 16 replica simulation running in a completely different
folder, using different nodes, and it also had 3 failures, and they appear
to be at least within the same minute based on the wallclock. Maybe this
suggests some kind of hardware problem that occurred at that time?
5) The other thing I'm noticing is that hill updates from some replicas
that are still running seem to stop occurring for a long time, and then a
large chunk of them are added, with the message that X hills are close to
the grid boundaries and are being computed analytically. I see the reason
for this, but I'm wondering if perhaps that is partially to blame in all of
this, and I should increase my grid boundaries substantially?
6) One last thing to note is that I recently had a 48 replica run complete
without trouble, although in terms of communication, each replica only
needed to get half as many hills, half as often.
On Thu, Jun 14, 2012 at 2:48 PM, Giacomo Fiorin <giacomo.fiorin_at_gmail.com>wrote:
> Hi Aron, indeed this is very interesting, I've never seen this low-level
> error message yet.
> Internally, the code that replica A catches an internal error when it
> can't read correctly a new block of data from replica B, because it was not
> completely written yet. Then, replica A moves on but saves the position in
> the file of replica B, and tries to read again at the next update.
> If the file has been written partially (i.e. it stopped in the middle of
> writing a number), you should get a warning that the above has happened and
> that the simulation continues.
> It looks like this is a lower-level error (you just can't get characters
> out of the files).
> How often has the error shown up? Has it occurred at all times since the
> beginning of the simulation, or was its occurrence only in certain periods?
> On Thu, Jun 14, 2012 at 2:26 PM, Aron Broom <broomsday_at_gmail.com> wrote:
>> I'm running multiple walker MetaDynamics, and for a few of the replicas,
>> after a random period of time, the run crashes with the following error:
>> terminate called after throwing an instance of 'std::ios_base::failure'
>> what(): basic_filebuf::underflow error reading the file
>> /var/spool/torque/mom_priv/jobs/4378.mon240.monk.sharcnet.SC: line 3:
>> 30588 Aborted
>> (core dumped) ../../../../NAMD/NAMD_2.9_Linux-x86_64-multicore-CUDA/namd2
>> +p4 +idlepoll +mergegrids Galactose_Meta_Run.namd
>> I suspect the last two lines are rather meaningless, but I included them
>> for completeness. I'm not sure, but I think this results when replica A is
>> attempting to read the hills from replica B while replica B is adding new
>> hills, or alternatively when two replicas are trying to read hills from
>> another replica at the same time. If that is the case, then I suppose
>> losing some synchronization between the replicas by increasing the time
>> between updates might help. But I'd ideally like to avoid that, and was
>> wondering if maybe this is a hardware or operating system specific
>> Aron Broom M.Sc
>> PhD Student
>> Department of Chemistry
>> University of Waterloo
-- Aron Broom M.Sc PhD Student Department of Chemistry University of Waterloo
This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:22:07 CST