Re: namd crash: Signal: segmentation violation

From: Leandro Martínez (leandromartinez98_at_gmail.com)
Date: Fri May 18 2007 - 12:41:24 CDT

Hi Brian,
Actually I have checked now and the /lib directory
is only a symbolic link to /lib64. Therefore /lib/libc.so.6
is /lib64/libc.so.6. Therefore the use of the wrong library
is not the problem.
Sorry about this one.

Curiously, however, the 32 bit binary of namd 2.5 we are
trying now uses 64 bit libraries on our fedora 6.0 Opteron cluster
(running for 36 hours now...), but uses 32 bit libraries on
the Gentoo Athlon 64 cluster (running for four hours now),
as shown by "ldd ./namd2".

If these runs with the namd 2.5 32 bit binary do not crash,
then the problem must be related to the new namd version
and some library, I guess, in some very particular way that
most people does not see it.

I going to write again after we are finally confident that these
simulations with "old" binaries are stable.

Thanks,
Leandro.

On 5/18/07, Brian Bennion <bennion1_at_llnl.gov> wrote:
> HI Leandro,
> My guess is that the amd64 binary was built on a
> machine where lib64/libc was the default
> path. Now when running it on your Athlon64 which
> appears to be capable of both 32 and 64bit modes
> it is finding /lib/libc as the default
> path. Just conjecture on my part though.
> looking at the link line in your compile output I
> can't find where it calls the system libraries
> like libc.so.6. Can you send the makefile?
>
> Confirm that /lib64/libc.so.6 exists on the new Athlon64 cluster.
>
> My next guess would be to try unloading
> /lib/libc from the loader config file
> "ld_config.in" It should be in your /etc dir in
> redhat on thealthlon64 machines.
>
> Brian
>
> At 05:25 AM 5/18/2007, Leandro Martínez wrote:
> >Hi Brian,
> >
> >>/lib/libc.so.6 is being called and not /lib64/libc.so.6
> >
> >That's a good point. I have no idea why this library is being called
> >instead of the 64 bit one, this happens with all namd 2.6 binaries
> >we tested.
> >
> >- The output of the namd build is at
> >http://limes.iqm.unicamp.br/~lmartinez/compilation.log
> >
> >- When I say that we are using a single node I meant the
> >master and one node. Both Athlon 64 dual core machines, connected
> >through a gigabit network. Thus the simulations are running
> >with 4 cpus. When the simulation run only on the
> >master (locally) it doesn't crash. The node is diskless.
> >
> >- The full set of files of our test is available at
> >http://limes.iqm.unicamp.br/~lmartinez/namd-test.tar.gz
> >Uncompressing it you will get a directory named namd-test.
> >Inside it there is a "test.namd" file, which is the input file for
> >namd. The binaries are not there, I can send you some of
> >them if you like, but actually most tests were done with
> >the binaries available at the NAMD site. We are using, for
> >example,
> >charmrun +p4 ++nodelist ./nodelist ++remote-shell ssh ./namd2 test-namd
> >to run the simulation.
> >
> >I will try debugging some run which crashes.
> >
> >A relevant information is that I am now running the same test
> >with a 32 bit binary of NAMD 2.5 (Actually NAMD 2.5 for Linux-i686-TCP),
> >in our opteron cluster with Fedora 6.0 (15 nodes, 30 cpus)
> >and it is running for 30 hours. A few more days and we will be happy on
> >that at least. But note that we have tried running the same simulation
> >with the NAMD 2.6 for Linux-amd64 binary, in the same
> >cluster, and we got the segmentation violation error. I will be trying
> >to run the old binary in our new Athlon 64 cluster to see if we get
> >a more stable run.
> >
> >Thanks again,
> >Leandro.
> >
> >
> >
> >
> >
> >
> >>
> >>When you say one node do you mean that the namd
> >>job is running on a single computer with more
> >>than 1 cpu? If this is true then the networking
> >>problem is probably a red herring.
> >>Do you mind sending appropriate files for me to test on our opteron clusters?
> >>
> >>The debugger could be gdb if running on one node
> >>We typically use totalview
> >>There are others that might be more friendly.
> >>
> >>I would also suggest a NAMD build that has a -g
> >>switch turned on to collect more debug.
> >>
> >>At 01:39 PM 5/17/2007, Leandro Martínez wrote:
> >> >Hi Brian, thank you very much for the answer. The stack traces I sent
> >> >before were not obtained in the same hardware. The ones that I'm
> >> >sending below are both from the same mini-cluster (one node), but
> >> >different namd binaries, which I specify. As you will see, they are also
> >> >the same kind of errors, but each one with a different stack trace.
> >> >
> >> >I also think there is some corruption of the
> >> >data in transit, but I have no idea how to
> >> track or solve it. We have in some
> >> >of our previous attempts obtained errors in which "a atom moving too
> >> >fast was detected", but the velocity was absurd only in one the three
> >> >components, thus sugesting that that was corrupted data rather than
> >> >an actual simulation issue. In our present configuration we have not
> >> >seen this problem anymore, but only these "libc.so.6" issues.
> >> >
> >> >The network cards we have already changed once, and this problem
> >> >appeared in two different clusters, so I wouldn't really bet that it is
> >> >a hardware problem.
> >> >
> >> >Can you suggest some debugger? I'm not familiar with those.
> >> >I have ran the simulation using ++debug but I couldn't get any
> >> >meaningful information.
> >> >
> >> >Thanks,
> >> >Leandro.
> >> >
> >> >These are the stack traces of three runs in the same cluster, same
> >> >nodes, different binaries:
> >> >
> >> >1) Using home-compiled namd binary with no fftw:
> >> >
> >> >ENERGY: 10600 19987.4183 13118.5784 1321.5175
> >> >124.6337 -252623.9701 24015.9942 0.0000
> >> >0.0000 52674.4657 -141381.3625 296.7927
> >> >-140902.1798 -140903.3821 298.0487 -1714.5778
> >> >-1687.4130 636056.0000 -1638.2391 -1637.8758
> >> >
> >> >------------- Processor 0 Exiting: Caught Signal ------------
> >> >Signal: segmentation violation
> >> >Suggestion: Try running with '++debug', or linking with '-memory paranoid'.
> >> >Stack Traceback:
> >> > [0] /lib/libc.so.6 [0x2b770d18a5c0]
> >> > [1] _ZN17ComputeHomeTuplesI8BondElem4bond9BondValueE10loadTuplesEv+0x597
> >> >[0x4fc3b7]
> >> > [2] _ZN17ComputeHomeTuplesI8BondElem4bond9BondValueE6doWorkEv+0x12ab
> >> >[0x50a78b]
> >> > [3]
> >> >
> >> _ZN19CkIndex_WorkDistrib31_call_enqueueBonds_LocalWorkMsgEPvP11WorkDistrib+0xd
> >> >[0x6986bd]
> >> > [4] CkDeliverMessageFree+0x30 [0x6eb1c0]
> >> > [5] /home/lmartinez/namd-nofftw/./namd2 [0x6eb21f]
> >> > [6] /home/lmartinez/namd-nofftw/./namd2 [0x6ec501]
> >> > [7] /home/lmartinez/namd-nofftw/./namd2 [0x6eea9a]
> >> > [8] /home/lmartinez/namd-nofftw/./namd2 [0x6eed48]
> >> > [9] _Z15_processHandlerPvP11CkCoreState+0x130 [0x6efd68]
> >> > [10] CmiHandleMessage+0xa5 [0x756e11]
> >> > [11] CsdScheduleForever+0x75 [0x7571d2]
> >> > [12] CsdScheduler+0x16 [0x757135]
> >> > [13] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x153 [0x676ec3]
> >> > [14] TclInvokeStringCommand+0x64 [0x2b770cb626a4]
> >> > [15] TclEvalObjvInternal+0x1aa [0x2b770cb63d9a]
> >> > [16] Tcl_EvalEx+0x397 [0x2b770cb64367]
> >> > [17] Tcl_FSEvalFile+0x1ed [0x2b770cba5f2d]
> >> > [18] Tcl_EvalFile+0x2e [0x2b770cba5fee]
> >> > [19] _ZN9ScriptTcl3runEPc+0x24 [0x677104]
> >> > [20] main+0x201 [0x4c2671]
> >> > [21] __libc_start_main+0xf4 [0x2b770d178134]
> >> > [22] __gxx_personality_v0+0x109 [0x4bf739]
> >> >Fatal error on PE 0> segmentation violation
> >> >
> >> >2) Using provided binary amd64-TCP:
> >> >
> >> >ENERGY: 9800 20072.8949 13201.1162 1343.0570
> >> >131.8154 -253024.4316 24209.4947 0.0000
> >> >0.0000 53016.0637 -141049.9898 298.7174
> >> >-140572.1009 -140570.0247 299.2188 -1871.3003
> >> >-1682.6774 636056.0000 -1712.7694 -1711.5329
> >> >
> >> >------------- Processor 0 Exiting: Caught Signal ------------
> >> >Signal: segmentation violation
> >> >Suggestion: Try running with '++debug', or linking with '-memory paranoid'.
> >> >Stack Traceback:
> >> > [0] /lib/libc.so.6 [0x2ae403d345c0]
> >> > [1] _int_malloc+0xb6 [0x778758]
> >> > [2] mm_malloc+0x53 [0x7785f7]
> >> > [3] malloc+0x16 [0x77c5fc]
> >> > [4] _Znwm+0x1d [0x2ae403bc0b4d]
> >> > [5] _ZN11ResizeArrayI6VectorEC1Ev+0x28 [0x6767e0]
> >> > [6] __cxa_vec_ctor+0x46 [0x2ae403bc22a6]
> >> > [7] _ZN14ProxyResultMsg6unpackEPv+0x62 [0x6e83ca]
> >> > [8] _Z15CkUnpackMessagePP8envelope+0x28 [0x787036]
> >> > [9] _Z15_processHandlerPvP11CkCoreState+0x412 [0x785db2]
> >> > [10] CsdScheduleForever+0xa2 [0x7f2492]
> >> > [11] CsdScheduler+0x1c [0x7f2090]
> >> > [12] _ZN7BackEnd7suspendEv+0xb [0x4ba881]
> >> > [13] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x122 [0x6fbfe0]
> >> > [14] TclInvokeStringCommand+0x91 [0x80d9b8]
> >> > [15] /home/lmartinez/teste-namd/./namd2 [0x843808]
> >> > [16] Tcl_EvalEx+0x176 [0x843e4b]
> >> > [17] Tcl_EvalFile+0x134 [0x83b854]
> >> > [18] _ZN9ScriptTcl3runEPc+0x14 [0x6fb71e]
> >> > [19] main+0x21b [0x4b6743]
> >> > [20] __libc_start_main+0xf4 [0x2ae403d22134]
> >> > [21] _ZNSt8ios_base4InitD1Ev+0x3a [0x4b28da]
> >> >Fatal error on PE 0> segmentation violation
> >> >
> >> >3) Using amd64 (no-TCP)
> >> >
> >> >------------- Processor 0 Exiting: Caught Signal ------------
> >> >Signal: segmentation violation
> >> >Suggestion: Try running with '++debug', or linking with '-memory paranoid'.
> >> >Stack Traceback:
> >> > [0] /lib/libc.so.6 [0x2af45d3b35c0]
> >> > [1] _ZN20ComputeNonbondedUtil9calc_pairEP9nonbonded+0x29f0 [0x5377b0]
> >> > [2]
> >> > _ZN20ComputeNonbondedPair7doForceEPP8CompAtomPP7Results+0x5da [0x52fc9e]
> >> > [3] _ZN16ComputePatchPair6doWorkEv+0x85 [0x615dd1]
> >> > [4] _ZN11WorkDistrib12enqueueWorkBEP12LocalWorkMsg+0x16 [0x728d16]
> >> > [5]
> >> >
> >> _ZN19CkIndex_WorkDistrib31_call_enqueueWorkB_LocalWorkMsgEPvP11WorkDistrib+0xf
> >> >[0x728cfd]
> >> > [6] CkDeliverMessageFree+0x21 [0x786a6b]
> >> > [7] _Z15_processHandlerPvP11CkCoreState+0x455 [0x786075]
> >> > [8] CsdScheduleForever+0xa2 [0x7f18a2]
> >> > [9] CsdScheduler+0x1c [0x7f14a0]
> >> > [10] _ZN7BackEnd7suspendEv+0xb [0x4bab01]
> >> > [11] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x122 [0x6fc260]
> >> > [12] TclInvokeStringCommand+0x91 [0x80cc78]
> >> > [13] /home/lmartinez/teste-namd/./NAMD_2.6_Linux-amd64/namd2 [0x842ac8]
> >> > [14] Tcl_EvalEx+0x176 [0x84310b]
> >> > [15] Tcl_EvalFile+0x134 [0x83ab14]
> >> > [16] _ZN9ScriptTcl3runEPc+0x14 [0x6fb99e]
> >> > [17] main+0x21b [0x4b69c3]
> >> > [18] __libc_start_main+0xf4 [0x2af45d3a1134]
> >> > [19] _ZNSt8ios_base4InitD1Ev+0x3a [0x4b2b5a]
> >> >Fatal error on PE 0> segmentation violation
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:44:41 CST