Re: unusual, periodic crash in Linux FC3/GM/MPI/CHARM/NAMD

From: Dan Strahs (dstrahs_at_pace.edu)
Date: Thu Feb 09 2006 - 20:09:02 CST

Hi Jonathan:

I think the ~360 hours really is ~360. The simulation outputs every 10
steps, so while the total # of steps is approximate, it is between 2179310
and 2179320. Thus, the total time is between 1288787 seconds and 1288793
seconds, or 357.996 and 357.998 hours. My NAMD/CHARM is reporting average
TIMING every 10 steps; the number I used is the average of every 10 step
average.

The log file ends with the previously reported MPI error messages, and
cleanup from shutting down the simulation (MPI mapping info, ssh reaper
messages and such) - nothing about the crash other than the message, and
a couple of comparable messages from other nodes reporting the failure to
contact node 0.

My disks are pretty empty; / (where /tmp and /opt and the simulations are)
is only 13% in use, with ~200 Gbyte free. Thanks for asking 8~).

Dan

On Thu, 9 Feb 2006 jonathan_at_ibt.unam.mx wrote:

> I don't know... those "~360 hours" in length might just be 360.0 hours. If I
> recall correctly, NAMD reports the simulation speed only a few times in the
> beginning of the output, so those 358.0 hours that show up after doing the
> numbers might have an uncertainty of a couple hours.
>
> Does your cluster write a log file at the end of a job? When I found
> that my local cluster insisted on killing my runs I checked the logs and
> they all reported a duration of 20 hours plus/minus a few seconds, so it
> would be pretty obvious from that data.
>
> I suppose that when the second crash files were created the previous
> ones were still in the disk, so this wouldn't be an out-of-space issue,
> right? (Had to ask).
>
> Good luck.
>
> J. Valencia
>
> ----------------------------------------------------------------
> Este mensaje fue enviado desde el servidor Webmail del Instituto de Biotecnologia.
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 05:19:14 CST