Re: NAMD fails on Altix

From: Gengbin Zheng (gzheng_at_ks.uiuc.edu)
Date: Fri Jan 13 2006 - 02:15:56 CST

Hi,

 From the stack trace, it looks like a corrupted message was received,
because CWeb_Reduce() in #5 should never be called for NAMD when the web
performance data collection feauture is not enabled.
If your Charm++ was not compiled with CMK_OPTIMIZE, run namd again
with "+checksum" option, like:
mpirun -np 32 ./namd2 conf +checksum

If it is accepted, you will see at the very beginning of the screen output
Charm++: CheckSum checking enabled!
,otherwise:
Charm++: +checksum ignored in optimized version!

If checksum test did not find anything, it could be memory corruption
bug somewhere in either Charm or NAMD.

Gengbin

Margaret Kahn wrote:

> One of our users on the SGI Altix (Itanium processors, using MPT) is
> seeing his NAMD jobs fail after about 4 hours on 8 processors and the
> traceback from the failure is as follows:
>
> received signal SIGSEGV(11)
>
>
> MPI: --------stack traceback-------
> Internal Error: Can't read/write file "/dev/mmtimer", (errno = 22)
> MPI: Intel(R) Debugger for Itanium(R) -based Applications, Version
> 8.1-14,
> Build 20051006
> MPI: Reading symbolic information from /opt/namd-2.5/bin/namd2...done
> MPI: Attached to process id 29931 ....
> MPI: stopped at [0xa000000000010641]
> MPI: >0 0xa000000000010641
> MPI: #1 0x200000000418ccc0 in __libc_waitpid(...) in /lib/tls/
> libc.so.6.1
> MPI: #2 0x20000000001ba700 in MPI_SGI_stacktraceback(...)
> in /opt/mpt-1.12/lib/libmpi.so
> MPI: #3 0x20000000001bb3e0 in slave_sig_handler(...)
> in /opt/mpt-1.12/lib/libmpi.so
> MPI: #4 0xa0000000000107e0
> MPI: #5 0x40000000008e5e70 in _Z11CWeb_ReducePv(...)
> in /opt/namd-2.5/bin/namd2
> MPI: #6 0x4000000000892390 in CmiHandleMessage(...)
> in /opt/namd-2.5/bin/namd2
> MPI: #7 0x4000000000892a60 in CsdScheduleForever(...)
> in /opt/namd-2.5/bin/namd2
> MPI: #8 0x4000000000892960 in CsdScheduler(...) in /opt/namd-2.5/bin/
> namd2
> MPI: #9 0x40000000000bbrokedown7ff0 in _ZN7BackEnd7suspendEv(...)
> in /opt/namd-2.5/bin/namd2
> MPI: #10 0x40000000004c3660 in _ZN9ScriptTcl7suspendEv(...)
> in /opt/namd-2.5/bin/namd2
> MPI: #11 0x40000000004c39e0 in _ZN9ScriptTcl13runControllerEi(...)
> in /opt/namd-2.5/bin/namd2
> MPI: #12 0x40000000004c6e50 in
> _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc(...)
> in /opt/namd-2.5/bin/namd2
> MPI: #13 0x2000000003c19d20 in TclInvokeStringCommand(...)
> in /usr/lib/libtcl8.4.so
> MPI: #14 0x2000000003ca0650 in TclEvalObjvInternal(...)
> in /usr/lib/libtcl8.4.so
> MPI: #15 0x2000000003cc9e10 in TclExecuteByteCode(...)
> in /usr/lib/libtcl8.4.so
> MPI: #16 0x2000000003cd4510 in TclCompEvalObj(...) in /usr/lib/
> libtcl8.4.so
> MPI: #17 0x2000000003ca20a0 in Tcl_EvalObjEx(...) in /usr/lib/
> libtcl8.4.so
> MPI: #18 0x2000000003ca8860 in Tcl_ForeachObjCmd(...) in /usr/lib/
> libtcl8.4.so
> MPI: #19 0x2000000003ca0650 in TclEvalObjvInternal(...)
> in /usr/lib/libtcl8.4.so
> MPI: #20 0x2000000003ca0eb0 in Tcl_EvalEx(...) in /usr/lib/libtcl8.4.so
> MPI: #21 0x2000000003c385f0 in Tcl_FSEvalFile(...) in /usr/lib/
> libtcl8.4.so
> MPI: #22 0x2000000003c7a950 in Tcl_EvalFile(...) in /usr/lib/
> libtcl8.4.so
> MPI: #23 0x40000000004c3330 in _ZN9ScriptTcl3runEPc(...)
> in /opt/namd-2.5/bin/namd2
> MPI: #24 0x40000000000ac930 in main(...) in /opt/namd-2.5/bin/namd2
> MPI: #25 0x20000000040ad850 in __libc_start_main(...) in /lib/tls/
> libc.so.6.1
> MPI: #26 0x40000000000a6bc0 in _start(...) in /opt/namd-2.5/bin/namd2
>
> This was first using namd-2.5. We have since installed namd-2.6b1 and
> built a separate tcl8.3 library as we only had tcl8.4 however the job
> still fails at the same place.
>
> We would appreciate any suggestions as to how to set about solving
> this problem.
>
> Thanks in advance,
>
>
> Margaret Kahn
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:41:31 CST