Strange termination-possible load ballancing issues

From: Richard Swenson (swenson_at_hec.utah.edu)
Date: Tue Nov 06 2007 - 10:13:02 CST

Hi NAMD community,

while trying to minimize and equilibrate a ~300,000 atom bilayer system
on one of our local machines, NAMD crashes at random places. The log
file just ends after random steps without giving any indication of the
problem. I have attached an example of the PBS error file. When I run
the same system on PSC's Big Ben, NAMD is stable, but I find the
following warning in my log file: "Warning: 1 processors are overloaded
due to high background load." We suspect that the problem has something
to do with the load balance on our local machine. Systems that are
equilibrated run fine. Has anyone resolved this problem before?

We are using the NAMD 2.6b3 release in conjunction with mpirun
(MPIVERSION="InfiniPath Release2.1 of Fri Jul 20 15:17:27 PDT 2007) on
our local machine and "NAMD 2.6 for XT3" on Big Ben in conjunction with
pbsyod (a pbs wrapper for yod). I am hoping that the difference is not
strictly because of the difference in architecture so that we can
resolve this problem on our local machine.

thanks for the help,

Richard


namd2:18170 terminated with signal 11 at PC=6d4344 SP=402838d0. Backtrace:
/uufs/telluride.arches/sys/pkg/namd/std/namd2(_ZN12PmeRealSpace12fill_chargesEPPdPcS2_P11PmeParticle+0x2f6)[0x6d4344]
/uufs/telluride.arches/sys/pkg/namd/std/namd2(_ZN10ComputePme6doWorkEv+0x138b)[0x5f8f55]
/uufs/telluride.arches/sys/pkg/namd/std/namd2(_ZN11WorkDistrib18messageEnqueueWorkEP7Compute+0xa1)[0x7130ab]
/uufs/telluride.arches/sys/pkg/namd/std/namd2(_ZN7Compute10patchReadyEiii+0x92)[0x4abeac]
/uufs/telluride.arches/sys/pkg/namd/std/namd2(_ZN5Patch14positionsReadyEi+0x8f8)[0x6bd2e6]
/uufs/telluride.arches/sys/pkg/namd/std/namd2(_ZN9HomePatch14positionsReadyEi+0x13d8)[0x66a9c2]
/uufs/telluride.arches/sys/pkg/namd/std/namd2(_ZN9Sequencer17runComputeObjectsEii+0xaf)[0x6edc69]
/uufs/telluride.arches/sys/pkg/namd/std/namd2(_ZN9Sequencer9integrateEv+0xdf9)[0x6f1e07]
/uufs/telluride.arches/sys/pkg/namd/std/namd2(_ZN9Sequencer9threadRunEPS_+0x856)[0x6fe350]
/uufs/telluride.arches/sys/pkg/namd/std/namd2[0x73c4f1]
/lib64/tls/libpthread.so.0[0x2a95fca137]
/lib64/tls/libc.so.6(__clone+0x73)[0x2a95e56113]
MPIRUN.tr018: 111 ranks have not yet exited 60 seconds after rank 69 (node tr040) exited without reaching MPI_Finalize().
MPIRUN.tr018: Waiting at most another 60 seconds for the remaining ranks to do a clean shutdown before terminating 111 node processes

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:45:29 CST