Re: namd crash: Signal: segmentation violation

From: Leandro Martínez (leandromartinez98_at_gmail.com)
Date: Thu May 17 2007 - 15:39:56 CDT

Hi Brian, thank you very much for the answer. The stack traces I sent
before were not obtained in the same hardware. The ones that I'm
sending below are both from the same mini-cluster (one node), but
different namd binaries, which I specify. As you will see, they are also
the same kind of errors, but each one with a different stack trace.

I also think there is some corruption of the
data in transit, but I have no idea how to track or solve it. We have in some
of our previous attempts obtained errors in which "a atom moving too
fast was detected", but the velocity was absurd only in one the three
components, thus sugesting that that was corrupted data rather than
an actual simulation issue. In our present configuration we have not
seen this problem anymore, but only these "libc.so.6" issues.

The network cards we have already changed once, and this problem
appeared in two different clusters, so I wouldn't really bet that it is
a hardware problem.

Can you suggest some debugger? I'm not familiar with those.
I have ran the simulation using ++debug but I couldn't get any
meaningful information.

Thanks,
Leandro.

These are the stack traces of three runs in the same cluster, same
nodes, different binaries:

1) Using home-compiled namd binary with no fftw:

ENERGY: 10600 19987.4183 13118.5784 1321.5175
124.6337 -252623.9701 24015.9942 0.0000
0.0000 52674.4657 -141381.3625 296.7927
-140902.1798 -140903.3821 298.0487 -1714.5778
-1687.4130 636056.0000 -1638.2391 -1637.8758

------------- Processor 0 Exiting: Caught Signal ------------
Signal: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid'.
Stack Traceback:
  [0] /lib/libc.so.6 [0x2b770d18a5c0]
  [1] _ZN17ComputeHomeTuplesI8BondElem4bond9BondValueE10loadTuplesEv+0x597
 [0x4fc3b7]
  [2] _ZN17ComputeHomeTuplesI8BondElem4bond9BondValueE6doWorkEv+0x12ab
 [0x50a78b]
  [3] _ZN19CkIndex_WorkDistrib31_call_enqueueBonds_LocalWorkMsgEPvP11WorkDistrib+0xd
 [0x6986bd]
  [4] CkDeliverMessageFree+0x30 [0x6eb1c0]
  [5] /home/lmartinez/namd-nofftw/./namd2 [0x6eb21f]
  [6] /home/lmartinez/namd-nofftw/./namd2 [0x6ec501]
  [7] /home/lmartinez/namd-nofftw/./namd2 [0x6eea9a]
  [8] /home/lmartinez/namd-nofftw/./namd2 [0x6eed48]
  [9] _Z15_processHandlerPvP11CkCoreState+0x130 [0x6efd68]
  [10] CmiHandleMessage+0xa5 [0x756e11]
  [11] CsdScheduleForever+0x75 [0x7571d2]
  [12] CsdScheduler+0x16 [0x757135]
  [13] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x153 [0x676ec3]
  [14] TclInvokeStringCommand+0x64 [0x2b770cb626a4]
  [15] TclEvalObjvInternal+0x1aa [0x2b770cb63d9a]
  [16] Tcl_EvalEx+0x397 [0x2b770cb64367]
  [17] Tcl_FSEvalFile+0x1ed [0x2b770cba5f2d]
  [18] Tcl_EvalFile+0x2e [0x2b770cba5fee]
  [19] _ZN9ScriptTcl3runEPc+0x24 [0x677104]
  [20] main+0x201 [0x4c2671]
  [21] __libc_start_main+0xf4 [0x2b770d178134]
  [22] __gxx_personality_v0+0x109 [0x4bf739]
Fatal error on PE 0> segmentation violation

2) Using provided binary amd64-TCP:

ENERGY: 9800 20072.8949 13201.1162 1343.0570
131.8154 -253024.4316 24209.4947 0.0000
0.0000 53016.0637 -141049.9898 298.7174
-140572.1009 -140570.0247 299.2188 -1871.3003
-1682.6774 636056.0000 -1712.7694 -1711.5329

------------- Processor 0 Exiting: Caught Signal ------------
Signal: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid'.
Stack Traceback:
  [0] /lib/libc.so.6 [0x2ae403d345c0]
  [1] _int_malloc+0xb6 [0x778758]
  [2] mm_malloc+0x53 [0x7785f7]
  [3] malloc+0x16 [0x77c5fc]
  [4] _Znwm+0x1d [0x2ae403bc0b4d]
  [5] _ZN11ResizeArrayI6VectorEC1Ev+0x28 [0x6767e0]
  [6] __cxa_vec_ctor+0x46 [0x2ae403bc22a6]
  [7] _ZN14ProxyResultMsg6unpackEPv+0x62 [0x6e83ca]
  [8] _Z15CkUnpackMessagePP8envelope+0x28 [0x787036]
  [9] _Z15_processHandlerPvP11CkCoreState+0x412 [0x785db2]
  [10] CsdScheduleForever+0xa2 [0x7f2492]
  [11] CsdScheduler+0x1c [0x7f2090]
  [12] _ZN7BackEnd7suspendEv+0xb [0x4ba881]
  [13] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x122 [0x6fbfe0]
  [14] TclInvokeStringCommand+0x91 [0x80d9b8]
  [15] /home/lmartinez/teste-namd/./namd2 [0x843808]
  [16] Tcl_EvalEx+0x176 [0x843e4b]
  [17] Tcl_EvalFile+0x134 [0x83b854]
  [18] _ZN9ScriptTcl3runEPc+0x14 [0x6fb71e]
  [19] main+0x21b [0x4b6743]
  [20] __libc_start_main+0xf4 [0x2ae403d22134]
  [21] _ZNSt8ios_base4InitD1Ev+0x3a [0x4b28da]
Fatal error on PE 0> segmentation violation

3) Using amd64 (no-TCP)

------------- Processor 0 Exiting: Caught Signal ------------
Signal: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid'.
Stack Traceback:
  [0] /lib/libc.so.6 [0x2af45d3b35c0]
  [1] _ZN20ComputeNonbondedUtil9calc_pairEP9nonbonded+0x29f0 [0x5377b0]
  [2] _ZN20ComputeNonbondedPair7doForceEPP8CompAtomPP7Results+0x5da [0x52fc9e]
  [3] _ZN16ComputePatchPair6doWorkEv+0x85 [0x615dd1]
  [4] _ZN11WorkDistrib12enqueueWorkBEP12LocalWorkMsg+0x16 [0x728d16]
  [5] _ZN19CkIndex_WorkDistrib31_call_enqueueWorkB_LocalWorkMsgEPvP11WorkDistrib+0xf
 [0x728cfd]
  [6] CkDeliverMessageFree+0x21 [0x786a6b]
  [7] _Z15_processHandlerPvP11CkCoreState+0x455 [0x786075]
  [8] CsdScheduleForever+0xa2 [0x7f18a2]
  [9] CsdScheduler+0x1c [0x7f14a0]
  [10] _ZN7BackEnd7suspendEv+0xb [0x4bab01]
  [11] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x122 [0x6fc260]
  [12] TclInvokeStringCommand+0x91 [0x80cc78]
  [13] /home/lmartinez/teste-namd/./NAMD_2.6_Linux-amd64/namd2 [0x842ac8]
  [14] Tcl_EvalEx+0x176 [0x84310b]
  [15] Tcl_EvalFile+0x134 [0x83ab14]
  [16] _ZN9ScriptTcl3runEPc+0x14 [0x6fb99e]
  [17] main+0x21b [0x4b69c3]
  [18] __libc_start_main+0xf4 [0x2af45d3a1134]
  [19] _ZNSt8ios_base4InitD1Ev+0x3a [0x4b2b5a]
Fatal error on PE 0> segmentation violation

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:46:18 CST