Re: namd crash: Signal: segmentation violation

From: Leandro Martínez (leandromartinez98_at_gmail.com)
Date: Fri May 18 2007 - 07:25:02 CDT

Hi Brian,

> /lib/libc.so.6 is being called and not /lib64/libc.so.6

That's a good point. I have no idea why this library is being called
instead of the 64 bit one, this happens with all namd 2.6 binaries
we tested.

- The output of the namd build is at
http://limes.iqm.unicamp.br/~lmartinez/compilation.log

- When I say that we are using a single node I meant the
master and one node. Both Athlon 64 dual core machines, connected
through a gigabit network. Thus the simulations are running
with 4 cpus. When the simulation run only on the
master (locally) it doesn't crash. The node is diskless.

- The full set of files of our test is available at
http://limes.iqm.unicamp.br/~lmartinez/namd-test.tar.gz
Uncompressing it you will get a directory named namd-test.
Inside it there is a "test.namd" file, which is the input file for
namd. The binaries are not there, I can send you some of
them if you like, but actually most tests were done with
the binaries available at the NAMD site. We are using, for
example,
charmrun +p4 ++nodelist ./nodelist ++remote-shell ssh ./namd2 test-namd
to run the simulation.

I will try debugging some run which crashes.

A relevant information is that I am now running the same test
with a 32 bit binary of NAMD 2.5 (Actually NAMD 2.5 for Linux-i686-TCP),
in our opteron cluster with Fedora 6.0 (15 nodes, 30 cpus)
and it is running for 30 hours. A few more days and we will be happy on
that at least. But note that we have tried running the same simulation
with the NAMD 2.6 for Linux-amd64 binary, in the same
cluster, and we got the segmentation violation error. I will be trying
to run the old binary in our new Athlon 64 cluster to see if we get
a more stable run.

Thanks again,
Leandro.

>
> When you say one node do you mean that the namd
> job is running on a single computer with more
> than 1 cpu? If this is true then the networking
> problem is probably a red herring.
> Do you mind sending appropriate files for me to test on our opteron clusters?
>
> The debugger could be gdb if running on one node
> We typically use totalview
> There are others that might be more friendly.
>
> I would also suggest a NAMD build that has a -g
> switch turned on to collect more debug.
>
> At 01:39 PM 5/17/2007, Leandro Martínez wrote:
> >Hi Brian, thank you very much for the answer. The stack traces I sent
> >before were not obtained in the same hardware. The ones that I'm
> >sending below are both from the same mini-cluster (one node), but
> >different namd binaries, which I specify. As you will see, they are also
> >the same kind of errors, but each one with a different stack trace.
> >
> >I also think there is some corruption of the
> >data in transit, but I have no idea how to track or solve it. We have in some
> >of our previous attempts obtained errors in which "a atom moving too
> >fast was detected", but the velocity was absurd only in one the three
> >components, thus sugesting that that was corrupted data rather than
> >an actual simulation issue. In our present configuration we have not
> >seen this problem anymore, but only these "libc.so.6" issues.
> >
> >The network cards we have already changed once, and this problem
> >appeared in two different clusters, so I wouldn't really bet that it is
> >a hardware problem.
> >
> >Can you suggest some debugger? I'm not familiar with those.
> >I have ran the simulation using ++debug but I couldn't get any
> >meaningful information.
> >
> >Thanks,
> >Leandro.
> >
> >These are the stack traces of three runs in the same cluster, same
> >nodes, different binaries:
> >
> >1) Using home-compiled namd binary with no fftw:
> >
> >ENERGY: 10600 19987.4183 13118.5784 1321.5175
> >124.6337 -252623.9701 24015.9942 0.0000
> >0.0000 52674.4657 -141381.3625 296.7927
> >-140902.1798 -140903.3821 298.0487 -1714.5778
> >-1687.4130 636056.0000 -1638.2391 -1637.8758
> >
> >------------- Processor 0 Exiting: Caught Signal ------------
> >Signal: segmentation violation
> >Suggestion: Try running with '++debug', or linking with '-memory paranoid'.
> >Stack Traceback:
> > [0] /lib/libc.so.6 [0x2b770d18a5c0]
> > [1] _ZN17ComputeHomeTuplesI8BondElem4bond9BondValueE10loadTuplesEv+0x597
> >[0x4fc3b7]
> > [2] _ZN17ComputeHomeTuplesI8BondElem4bond9BondValueE6doWorkEv+0x12ab
> >[0x50a78b]
> > [3]
> > _ZN19CkIndex_WorkDistrib31_call_enqueueBonds_LocalWorkMsgEPvP11WorkDistrib+0xd
> >[0x6986bd]
> > [4] CkDeliverMessageFree+0x30 [0x6eb1c0]
> > [5] /home/lmartinez/namd-nofftw/./namd2 [0x6eb21f]
> > [6] /home/lmartinez/namd-nofftw/./namd2 [0x6ec501]
> > [7] /home/lmartinez/namd-nofftw/./namd2 [0x6eea9a]
> > [8] /home/lmartinez/namd-nofftw/./namd2 [0x6eed48]
> > [9] _Z15_processHandlerPvP11CkCoreState+0x130 [0x6efd68]
> > [10] CmiHandleMessage+0xa5 [0x756e11]
> > [11] CsdScheduleForever+0x75 [0x7571d2]
> > [12] CsdScheduler+0x16 [0x757135]
> > [13] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x153 [0x676ec3]
> > [14] TclInvokeStringCommand+0x64 [0x2b770cb626a4]
> > [15] TclEvalObjvInternal+0x1aa [0x2b770cb63d9a]
> > [16] Tcl_EvalEx+0x397 [0x2b770cb64367]
> > [17] Tcl_FSEvalFile+0x1ed [0x2b770cba5f2d]
> > [18] Tcl_EvalFile+0x2e [0x2b770cba5fee]
> > [19] _ZN9ScriptTcl3runEPc+0x24 [0x677104]
> > [20] main+0x201 [0x4c2671]
> > [21] __libc_start_main+0xf4 [0x2b770d178134]
> > [22] __gxx_personality_v0+0x109 [0x4bf739]
> >Fatal error on PE 0> segmentation violation
> >
> >2) Using provided binary amd64-TCP:
> >
> >ENERGY: 9800 20072.8949 13201.1162 1343.0570
> >131.8154 -253024.4316 24209.4947 0.0000
> >0.0000 53016.0637 -141049.9898 298.7174
> >-140572.1009 -140570.0247 299.2188 -1871.3003
> >-1682.6774 636056.0000 -1712.7694 -1711.5329
> >
> >------------- Processor 0 Exiting: Caught Signal ------------
> >Signal: segmentation violation
> >Suggestion: Try running with '++debug', or linking with '-memory paranoid'.
> >Stack Traceback:
> > [0] /lib/libc.so.6 [0x2ae403d345c0]
> > [1] _int_malloc+0xb6 [0x778758]
> > [2] mm_malloc+0x53 [0x7785f7]
> > [3] malloc+0x16 [0x77c5fc]
> > [4] _Znwm+0x1d [0x2ae403bc0b4d]
> > [5] _ZN11ResizeArrayI6VectorEC1Ev+0x28 [0x6767e0]
> > [6] __cxa_vec_ctor+0x46 [0x2ae403bc22a6]
> > [7] _ZN14ProxyResultMsg6unpackEPv+0x62 [0x6e83ca]
> > [8] _Z15CkUnpackMessagePP8envelope+0x28 [0x787036]
> > [9] _Z15_processHandlerPvP11CkCoreState+0x412 [0x785db2]
> > [10] CsdScheduleForever+0xa2 [0x7f2492]
> > [11] CsdScheduler+0x1c [0x7f2090]
> > [12] _ZN7BackEnd7suspendEv+0xb [0x4ba881]
> > [13] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x122 [0x6fbfe0]
> > [14] TclInvokeStringCommand+0x91 [0x80d9b8]
> > [15] /home/lmartinez/teste-namd/./namd2 [0x843808]
> > [16] Tcl_EvalEx+0x176 [0x843e4b]
> > [17] Tcl_EvalFile+0x134 [0x83b854]
> > [18] _ZN9ScriptTcl3runEPc+0x14 [0x6fb71e]
> > [19] main+0x21b [0x4b6743]
> > [20] __libc_start_main+0xf4 [0x2ae403d22134]
> > [21] _ZNSt8ios_base4InitD1Ev+0x3a [0x4b28da]
> >Fatal error on PE 0> segmentation violation
> >
> >3) Using amd64 (no-TCP)
> >
> >------------- Processor 0 Exiting: Caught Signal ------------
> >Signal: segmentation violation
> >Suggestion: Try running with '++debug', or linking with '-memory paranoid'.
> >Stack Traceback:
> > [0] /lib/libc.so.6 [0x2af45d3b35c0]
> > [1] _ZN20ComputeNonbondedUtil9calc_pairEP9nonbonded+0x29f0 [0x5377b0]
> > [2]
> > _ZN20ComputeNonbondedPair7doForceEPP8CompAtomPP7Results+0x5da [0x52fc9e]
> > [3] _ZN16ComputePatchPair6doWorkEv+0x85 [0x615dd1]
> > [4] _ZN11WorkDistrib12enqueueWorkBEP12LocalWorkMsg+0x16 [0x728d16]
> > [5]
> > _ZN19CkIndex_WorkDistrib31_call_enqueueWorkB_LocalWorkMsgEPvP11WorkDistrib+0xf
> >[0x728cfd]
> > [6] CkDeliverMessageFree+0x21 [0x786a6b]
> > [7] _Z15_processHandlerPvP11CkCoreState+0x455 [0x786075]
> > [8] CsdScheduleForever+0xa2 [0x7f18a2]
> > [9] CsdScheduler+0x1c [0x7f14a0]
> > [10] _ZN7BackEnd7suspendEv+0xb [0x4bab01]
> > [11] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x122 [0x6fc260]
> > [12] TclInvokeStringCommand+0x91 [0x80cc78]
> > [13] /home/lmartinez/teste-namd/./NAMD_2.6_Linux-amd64/namd2 [0x842ac8]
> > [14] Tcl_EvalEx+0x176 [0x84310b]
> > [15] Tcl_EvalFile+0x134 [0x83ab14]
> > [16] _ZN9ScriptTcl3runEPc+0x14 [0x6fb99e]
> > [17] main+0x21b [0x4b69c3]
> > [18] __libc_start_main+0xf4 [0x2af45d3a1134]
> > [19] _ZNSt8ios_base4InitD1Ev+0x3a [0x4b2b5a]
> >Fatal error on PE 0> segmentation violation
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:44:41 CST