Re: namd crash: Signal: segmentation violation

From: Brian Bennion (bennion1_at_llnl.gov)
Date: Thu May 17 2007 - 17:06:15 CDT

Hi Leandro,
I am still a bit confused about why
/lib/libc.so.6 is being called and not /lib64/libc.so.6

Maybe it doesn't matter. Jim Phillips might know.

Can you post your output from your build of
namd. I just want to see what libraries are being chosen.

When you say one node do you mean that the namd
job is running on a single computer with more
than 1 cpu? If this is true then the networking
problem is probably a red herring.
Do you mind sending appropriate files for me to test on our opteron clusters?

The debugger could be gdb if running on one node
We typically use totalview
There are others that might be more friendly.

I would also suggest a NAMD build that has a -g
switch turned on to collect more debug.

At 01:39 PM 5/17/2007, Leandro Martínez wrote:
>Hi Brian, thank you very much for the answer. The stack traces I sent
>before were not obtained in the same hardware. The ones that I'm
>sending below are both from the same mini-cluster (one node), but
>different namd binaries, which I specify. As you will see, they are also
>the same kind of errors, but each one with a different stack trace.
>
>I also think there is some corruption of the
>data in transit, but I have no idea how to track or solve it. We have in some
>of our previous attempts obtained errors in which "a atom moving too
>fast was detected", but the velocity was absurd only in one the three
>components, thus sugesting that that was corrupted data rather than
>an actual simulation issue. In our present configuration we have not
>seen this problem anymore, but only these "libc.so.6" issues.
>
>The network cards we have already changed once, and this problem
>appeared in two different clusters, so I wouldn't really bet that it is
>a hardware problem.
>
>Can you suggest some debugger? I'm not familiar with those.
>I have ran the simulation using ++debug but I couldn't get any
>meaningful information.
>
>Thanks,
>Leandro.
>
>These are the stack traces of three runs in the same cluster, same
>nodes, different binaries:
>
>1) Using home-compiled namd binary with no fftw:
>
>ENERGY: 10600 19987.4183 13118.5784 1321.5175
>124.6337 -252623.9701 24015.9942 0.0000
>0.0000 52674.4657 -141381.3625 296.7927
>-140902.1798 -140903.3821 298.0487 -1714.5778
>-1687.4130 636056.0000 -1638.2391 -1637.8758
>
>------------- Processor 0 Exiting: Caught Signal ------------
>Signal: segmentation violation
>Suggestion: Try running with '++debug', or linking with '-memory paranoid'.
>Stack Traceback:
> [0] /lib/libc.so.6 [0x2b770d18a5c0]
> [1] _ZN17ComputeHomeTuplesI8BondElem4bond9BondValueE10loadTuplesEv+0x597
>[0x4fc3b7]
> [2] _ZN17ComputeHomeTuplesI8BondElem4bond9BondValueE6doWorkEv+0x12ab
>[0x50a78b]
> [3]
> _ZN19CkIndex_WorkDistrib31_call_enqueueBonds_LocalWorkMsgEPvP11WorkDistrib+0xd
>[0x6986bd]
> [4] CkDeliverMessageFree+0x30 [0x6eb1c0]
> [5] /home/lmartinez/namd-nofftw/./namd2 [0x6eb21f]
> [6] /home/lmartinez/namd-nofftw/./namd2 [0x6ec501]
> [7] /home/lmartinez/namd-nofftw/./namd2 [0x6eea9a]
> [8] /home/lmartinez/namd-nofftw/./namd2 [0x6eed48]
> [9] _Z15_processHandlerPvP11CkCoreState+0x130 [0x6efd68]
> [10] CmiHandleMessage+0xa5 [0x756e11]
> [11] CsdScheduleForever+0x75 [0x7571d2]
> [12] CsdScheduler+0x16 [0x757135]
> [13] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x153 [0x676ec3]
> [14] TclInvokeStringCommand+0x64 [0x2b770cb626a4]
> [15] TclEvalObjvInternal+0x1aa [0x2b770cb63d9a]
> [16] Tcl_EvalEx+0x397 [0x2b770cb64367]
> [17] Tcl_FSEvalFile+0x1ed [0x2b770cba5f2d]
> [18] Tcl_EvalFile+0x2e [0x2b770cba5fee]
> [19] _ZN9ScriptTcl3runEPc+0x24 [0x677104]
> [20] main+0x201 [0x4c2671]
> [21] __libc_start_main+0xf4 [0x2b770d178134]
> [22] __gxx_personality_v0+0x109 [0x4bf739]
>Fatal error on PE 0> segmentation violation
>
>2) Using provided binary amd64-TCP:
>
>ENERGY: 9800 20072.8949 13201.1162 1343.0570
>131.8154 -253024.4316 24209.4947 0.0000
>0.0000 53016.0637 -141049.9898 298.7174
>-140572.1009 -140570.0247 299.2188 -1871.3003
>-1682.6774 636056.0000 -1712.7694 -1711.5329
>
>------------- Processor 0 Exiting: Caught Signal ------------
>Signal: segmentation violation
>Suggestion: Try running with '++debug', or linking with '-memory paranoid'.
>Stack Traceback:
> [0] /lib/libc.so.6 [0x2ae403d345c0]
> [1] _int_malloc+0xb6 [0x778758]
> [2] mm_malloc+0x53 [0x7785f7]
> [3] malloc+0x16 [0x77c5fc]
> [4] _Znwm+0x1d [0x2ae403bc0b4d]
> [5] _ZN11ResizeArrayI6VectorEC1Ev+0x28 [0x6767e0]
> [6] __cxa_vec_ctor+0x46 [0x2ae403bc22a6]
> [7] _ZN14ProxyResultMsg6unpackEPv+0x62 [0x6e83ca]
> [8] _Z15CkUnpackMessagePP8envelope+0x28 [0x787036]
> [9] _Z15_processHandlerPvP11CkCoreState+0x412 [0x785db2]
> [10] CsdScheduleForever+0xa2 [0x7f2492]
> [11] CsdScheduler+0x1c [0x7f2090]
> [12] _ZN7BackEnd7suspendEv+0xb [0x4ba881]
> [13] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x122 [0x6fbfe0]
> [14] TclInvokeStringCommand+0x91 [0x80d9b8]
> [15] /home/lmartinez/teste-namd/./namd2 [0x843808]
> [16] Tcl_EvalEx+0x176 [0x843e4b]
> [17] Tcl_EvalFile+0x134 [0x83b854]
> [18] _ZN9ScriptTcl3runEPc+0x14 [0x6fb71e]
> [19] main+0x21b [0x4b6743]
> [20] __libc_start_main+0xf4 [0x2ae403d22134]
> [21] _ZNSt8ios_base4InitD1Ev+0x3a [0x4b28da]
>Fatal error on PE 0> segmentation violation
>
>3) Using amd64 (no-TCP)
>
>------------- Processor 0 Exiting: Caught Signal ------------
>Signal: segmentation violation
>Suggestion: Try running with '++debug', or linking with '-memory paranoid'.
>Stack Traceback:
> [0] /lib/libc.so.6 [0x2af45d3b35c0]
> [1] _ZN20ComputeNonbondedUtil9calc_pairEP9nonbonded+0x29f0 [0x5377b0]
> [2]
> _ZN20ComputeNonbondedPair7doForceEPP8CompAtomPP7Results+0x5da [0x52fc9e]
> [3] _ZN16ComputePatchPair6doWorkEv+0x85 [0x615dd1]
> [4] _ZN11WorkDistrib12enqueueWorkBEP12LocalWorkMsg+0x16 [0x728d16]
> [5]
> _ZN19CkIndex_WorkDistrib31_call_enqueueWorkB_LocalWorkMsgEPvP11WorkDistrib+0xf
>[0x728cfd]
> [6] CkDeliverMessageFree+0x21 [0x786a6b]
> [7] _Z15_processHandlerPvP11CkCoreState+0x455 [0x786075]
> [8] CsdScheduleForever+0xa2 [0x7f18a2]
> [9] CsdScheduler+0x1c [0x7f14a0]
> [10] _ZN7BackEnd7suspendEv+0xb [0x4bab01]
> [11] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x122 [0x6fc260]
> [12] TclInvokeStringCommand+0x91 [0x80cc78]
> [13] /home/lmartinez/teste-namd/./NAMD_2.6_Linux-amd64/namd2 [0x842ac8]
> [14] Tcl_EvalEx+0x176 [0x84310b]
> [15] Tcl_EvalFile+0x134 [0x83ab14]
> [16] _ZN9ScriptTcl3runEPc+0x14 [0x6fb99e]
> [17] main+0x21b [0x4b69c3]
> [18] __libc_start_main+0xf4 [0x2af45d3a1134]
> [19] _ZNSt8ios_base4InitD1Ev+0x3a [0x4b2b5a]
>Fatal error on PE 0> segmentation violation

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:44:41 CST