From: Dan Strahs (dstrahs_at_pace.edu)
Date: Thu Feb 09 2006 - 17:01:30 CST
NAMD crashed for the 2nd time recently while running a 10-ns simulation.
What has raised the "hackles" on the back of my neck is that this crash
occurred after precisely the same number of steps as the previous crash.
Linux Fedora Core 3, kernel 2.6.11-1.35
GM 2.0.21 with MPICH-GM 1.2.6..14b
NAMD 2.6b1 with CHARM 5.9, build for Linux-amd64-MPI
Total time of this simulation is expected to be 10.5 ns, 1fs stepsize.
Initial chunk was from time 0 to 0.5 ns - no problem.
2nd chunk was from time 0.5 to 2.5 ns - no problem.
3rd chunk was from time 2.5 ns to 10.5 ns - crash near 4.67931 ns (step
2179310); restarted from time 4.5 ns.
4th chunk was from time 4.5 ns to 10.5 ns - crash near 6.67931 ns (step
2179310); currently restarting from time 6.5 ns.
It thus appears I have a crash occuring with a cyclic periodicity
(although there has been only one full cycle at most).
The only error messages I've been able to locate occur at the end of
0: signal 11 received, exiting..
0: Signal sent from unknown source.
FATAL ERROR on MPI node 9 (Camper4): GM send to MPI node 0 (HolidayCamp
[00:60:dd:49:36:e5]) failed: status 17 (target port was closed) the peer
process has not started, has exited or is dead
I haven't been able to locate any other messages. Given that the crash
appears to have a period, I'm leaning towards a software issue, rather
than hardware. Any ideas where to begin?
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:41:36 CST