From: Giacomo Fiorin (giacomo.fiorin_at_gmail.com)
Date: Thu Jun 14 2012 - 13:48:07 CDT
Hi Aron, indeed this is very interesting, I've never seen this low-level
error message yet.
Internally, the code that replica A catches an internal error when it can't
read correctly a new block of data from replica B, because it was not
completely written yet. Then, replica A moves on but saves the position in
the file of replica B, and tries to read again at the next update.
If the file has been written partially (i.e. it stopped in the middle of
writing a number), you should get a warning that the above has happened and
that the simulation continues.
It looks like this is a lower-level error (you just can't get characters
out of the files).
How often has the error shown up? Has it occurred at all times since the
beginning of the simulation, or was its occurrence only in certain periods?
On Thu, Jun 14, 2012 at 2:26 PM, Aron Broom <broomsday_at_gmail.com> wrote:
> I'm running multiple walker MetaDynamics, and for a few of the replicas,
> after a random period of time, the run crashes with the following error:
>
> terminate called after throwing an instance of 'std::ios_base::failure'
> what(): basic_filebuf::underflow error reading the file
> /var/spool/torque/mom_priv/jobs/4378.mon240.monk.sharcnet.SC: line 3:
> 30588 Aborted
> (core dumped) ../../../../NAMD/NAMD_2.9_Linux-x86_64-multicore-CUDA/namd2
> +p4 +idlepoll +mergegrids Galactose_Meta_Run.namd
>
> I suspect the last two lines are rather meaningless, but I included them
> for completeness. I'm not sure, but I think this results when replica A is
> attempting to read the hills from replica B while replica B is adding new
> hills, or alternatively when two replicas are trying to read hills from
> another replica at the same time. If that is the case, then I suppose
> losing some synchronization between the replicas by increasing the time
> between updates might help. But I'd ideally like to avoid that, and was
> wondering if maybe this is a hardware or operating system specific
> problem?
>
> Thanks,
>
> ~Aron
>
> --
> Aron Broom M.Sc
> PhD Student
> Department of Chemistry
> University of Waterloo
>
>
This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:39 CST