unusual, periodic crash in Linux FC3/GM/MPI/CHARM/NAMD

From: Dan Strahs (dstrahs_at_pace.edu)
Date: Thu Feb 09 2006 - 17:01:30 CST


NAMD crashed for the 2nd time recently while running a 10-ns simulation.
What has raised the "hackles" on the back of my neck is that this crash
occurred after precisely the same number of steps as the previous crash.

Installation details:
Linux Fedora Core 3, kernel 2.6.11-1.35
GM 2.0.21 with MPICH-GM 1.2.6..14b
NAMD 2.6b1 with CHARM 5.9, build for Linux-amd64-MPI

Simulation Details:
Total time of this simulation is expected to be 10.5 ns, 1fs stepsize.
Initial chunk was from time 0 to 0.5 ns - no problem.
2nd chunk was from time 0.5 to 2.5 ns - no problem.
3rd chunk was from time 2.5 ns to 10.5 ns - crash near 4.67931 ns (step
2179310); restarted from time 4.5 ns.
4th chunk was from time 4.5 ns to 10.5 ns - crash near 6.67931 ns (step
2179310); currently restarting from time 6.5 ns.

It thus appears I have a crash occuring with a cyclic periodicity
(although there has been only one full cycle at most).

The only error messages I've been able to locate occur at the end of
NAMD's output:

0: signal 11 received, exiting..
0: Signal sent from unknown source.
FATAL ERROR on MPI node 9 (Camper4): GM send to MPI node 0 (HolidayCamp
[00:60:dd:49:36:e5]) failed: status 17 (target port was closed) the peer
process has not started, has exited or is dead

I haven't been able to locate any other messages. Given that the crash
appears to have a period, I'm leaning towards a software issue, rather
than hardware. Any ideas where to begin?

Dan Strahs

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:41:36 CST