Re: namd crash: Signal: segmentation violation

From: Leandro Martínez (leandromartinez98_at_gmail.com)
Date: Fri May 18 2007 - 15:13:00 CDT

Adding more information:

The binaries compiled with -memory paranoid crash in all
dual-core cpus we have tried, when using more than
one process even the machines that run
simulations fine. However, in our Opteron 242 machines,
which are not dual-core (but have two independent
processors), the simulation does not crash. It seems
that the memory paranoid option is sensitive to the
dual-core architecture.

Here,

http://limes.iqm.unicamp.br/~lmartinez/namd-memoryparanoid.tar.gz

you can get the binaries compiled with memory paranoid
we are using and the other files of our test simulation.
For testing this crash just unpack the
file, go into the namd-memoryparanoid directory and run

./charmrun ++local +p2 ++verbose ./namd2 teste-gentoo.namd

In the dual-core machines it doesn't get to first time-step.
Thanks,
Leandro.

On 5/18/07, Leandro Martínez <leandromartinez98_at_gmail.com> wrote:
> Hi Brian and others,
> Well, in the Athlon 64 cluster the simulation with the old
> 32 bit binary has also crashed, with the following message:
>
> ------------- Processor 2 Exiting: Caught Signal ------------
> Signal: segmentation violation
> Suggestion: Try running with '++debug', or linking with '-memory paranoid'.
> req_handle_abort called
> Fatal error on PE 2> segmentation violation
>
> We have then compiled namd using the -memory paranoid
> option, using amd64 and g++ options (NAMD 2.6 for
> Linux-amd64--memory). The good news is that
> the simulations crash faster and reproductivelly, with
> the message:
>
> Info: Entering startup phase 8 with 60060 kB of memory in use.
> Info: Finished startup with 60060 kB of memory in use.
> ------------- Processor 0 Exiting: Caught Signal ------------
> Signal: segmentation violation
> Suggestion: Try running with '++debug', or linking with '-memory paranoid'.
> Stack Traceback:
> [0] /lib/libc.so.6 [0x2ae55a6745c0]
> [1] _ZN10Controller9threadRunEPS_+0 [0x5d8980]
> Fatal error on PE 0> segmentation violation
>
> This crash also occurs if we run with two processors locally
> in the master (charmrun ++local +p2) and if we
> run with two processes and two nodes (charmrun +p2 ++nodedelist
> ./nodelist) with two nodes (the master plus one) in the nodelist.
>
> However, it seems
> not to crash if we run with only one processor locally
> (charmrun ++local +p1) or if we don't use charmrun (./namd2 ...).
>
> Recalling that we are dealing with
> dual core processors.
>
> Thanks for any help,
> Leandro.
>
>
>
>
>
>
>
> On 5/18/07, Leandro Martínez <leandromartinez98_at_gmail.com> wrote:
> > Sorry, wrong again. The namd2 32 bit binary of our Fedora cluster
> > uses 32 bit libraries, as it should be.
> > Leandro.
> >
> > On 5/18/07, Leandro Martínez <leandromartinez98_at_gmail.com> wrote:
> > > Hi Brian,
> > > Actually I have checked now and the /lib directory
> > > is only a symbolic link to /lib64. Therefore /lib/libc.so.6
> > > is /lib64/libc.so.6. Therefore the use of the wrong library
> > > is not the problem.
> > > Sorry about this one.
> > >
> > > Curiously, however, the 32 bit binary of namd 2.5 we are
> > > trying now uses 64 bit libraries on our fedora 6.0 Opteron cluster
> > > (running for 36 hours now...), but uses 32 bit libraries on
> > > the Gentoo Athlon 64 cluster (running for four hours now),
> > > as shown by "ldd ./namd2".
> > >
> > > If these runs with the namd 2.5 32 bit binary do not crash,
> > > then the problem must be related to the new namd version
> > > and some library, I guess, in some very particular way that
> > > most people does not see it.
> > >
> > > I going to write again after we are finally confident that these
> > > simulations with "old" binaries are stable.
> > >
> > > Thanks,
> > > Leandro.
> > >
> > >
> > >
> > >
> > >
> > > On 5/18/07, Brian Bennion <bennion1_at_llnl.gov> wrote:
> > > > HI Leandro,
> > > > My guess is that the amd64 binary was built on a
> > > > machine where lib64/libc was the default
> > > > path. Now when running it on your Athlon64 which
> > > > appears to be capable of both 32 and 64bit modes
> > > > it is finding /lib/libc as the default
> > > > path. Just conjecture on my part though.
> > > > looking at the link line in your compile output I
> > > > can't find where it calls the system libraries
> > > > like libc.so.6. Can you send the makefile?
> > > >
> > > > Confirm that /lib64/libc.so.6 exists on the new Athlon64 cluster.
> > > >
> > > > My next guess would be to try unloading
> > > > /lib/libc from the loader config file
> > > > "ld_config.in" It should be in your /etc dir in
> > > > redhat on thealthlon64 machines.
> > > >
> > > > Brian
> > > >
> > > > At 05:25 AM 5/18/2007, Leandro Martínez wrote:
> > > > >Hi Brian,
> > > > >
> > > > >>/lib/libc.so.6 is being called and not /lib64/libc.so.6
> > > > >
> > > > >That's a good point. I have no idea why this library is being called
> > > > >instead of the 64 bit one, this happens with all namd 2.6 binaries
> > > > >we tested.
> > > > >
> > > > >- The output of the namd build is at
> > > > >http://limes.iqm.unicamp.br/~lmartinez/compilation.log
> > > > >
> > > > >- When I say that we are using a single node I meant the
> > > > >master and one node. Both Athlon 64 dual core machines, connected
> > > > >through a gigabit network. Thus the simulations are running
> > > > >with 4 cpus. When the simulation run only on the
> > > > >master (locally) it doesn't crash. The node is diskless.
> > > > >
> > > > >- The full set of files of our test is available at
> > > > >http://limes.iqm.unicamp.br/~lmartinez/namd-test.tar.gz
> > > > >Uncompressing it you will get a directory named namd-test.
> > > > >Inside it there is a "test.namd" file, which is the input file for
> > > > >namd. The binaries are not there, I can send you some of
> > > > >them if you like, but actually most tests were done with
> > > > >the binaries available at the NAMD site. We are using, for
> > > > >example,
> > > > >charmrun +p4 ++nodelist ./nodelist ++remote-shell ssh ./namd2 test-namd
> > > > >to run the simulation.
> > > > >
> > > > >I will try debugging some run which crashes.
> > > > >
> > > > >A relevant information is that I am now running the same test
> > > > >with a 32 bit binary of NAMD 2.5 (Actually NAMD 2.5 for Linux-i686-TCP),
> > > > >in our opteron cluster with Fedora 6.0 (15 nodes, 30 cpus)
> > > > >and it is running for 30 hours. A few more days and we will be happy on
> > > > >that at least. But note that we have tried running the same simulation
> > > > >with the NAMD 2.6 for Linux-amd64 binary, in the same
> > > > >cluster, and we got the segmentation violation error. I will be trying
> > > > >to run the old binary in our new Athlon 64 cluster to see if we get
> > > > >a more stable run.
> > > > >
> > > > >Thanks again,
> > > > >Leandro.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >>
> > > > >>When you say one node do you mean that the namd
> > > > >>job is running on a single computer with more
> > > > >>than 1 cpu? If this is true then the networking
> > > > >>problem is probably a red herring.
> > > > >>Do you mind sending appropriate files for me to test on our opteron clusters?
> > > > >>
> > > > >>The debugger could be gdb if running on one node
> > > > >>We typically use totalview
> > > > >>There are others that might be more friendly.
> > > > >>
> > > > >>I would also suggest a NAMD build that has a -g
> > > > >>switch turned on to collect more debug.
> > > > >>
> > > > >>At 01:39 PM 5/17/2007, Leandro Martínez wrote:
> > > > >> >Hi Brian, thank you very much for the answer. The stack traces I sent
> > > > >> >before were not obtained in the same hardware. The ones that I'm
> > > > >> >sending below are both from the same mini-cluster (one node), but
> > > > >> >different namd binaries, which I specify. As you will see, they are also
> > > > >> >the same kind of errors, but each one with a different stack trace.
> > > > >> >
> > > > >> >I also think there is some corruption of the
> > > > >> >data in transit, but I have no idea how to
> > > > >> track or solve it. We have in some
> > > > >> >of our previous attempts obtained errors in which "a atom moving too
> > > > >> >fast was detected", but the velocity was absurd only in one the three
> > > > >> >components, thus sugesting that that was corrupted data rather than
> > > > >> >an actual simulation issue. In our present configuration we have not
> > > > >> >seen this problem anymore, but only these "libc.so.6" issues.
> > > > >> >
> > > > >> >The network cards we have already changed once, and this problem
> > > > >> >appeared in two different clusters, so I wouldn't really bet that it is
> > > > >> >a hardware problem.
> > > > >> >
> > > > >> >Can you suggest some debugger? I'm not familiar with those.
> > > > >> >I have ran the simulation using ++debug but I couldn't get any
> > > > >> >meaningful information.
> > > > >> >
> > > > >> >Thanks,
> > > > >> >Leandro.
> > > > >> >
> > > > >> >These are the stack traces of three runs in the same cluster, same
> > > > >> >nodes, different binaries:
> > > > >> >
> > > > >> >1) Using home-compiled namd binary with no fftw:
> > > > >> >
> > > > >> >ENERGY: 10600 19987.4183 13118.5784 1321.5175
> > > > >> >124.6337 -252623.9701 24015.9942 0.0000
> > > > >> >0.0000 52674.4657 -141381.3625 296.7927
> > > > >> >-140902.1798 -140903.3821 298.0487 -1714.5778
> > > > >> >-1687.4130 636056.0000 -1638.2391 -1637.8758
> > > > >> >
> > > > >> >------------- Processor 0 Exiting: Caught Signal ------------
> > > > >> >Signal: segmentation violation
> > > > >> >Suggestion: Try running with '++debug', or linking with '-memory paranoid'.
> > > > >> >Stack Traceback:
> > > > >> > [0] /lib/libc.so.6 [0x2b770d18a5c0]
> > > > >> > [1] _ZN17ComputeHomeTuplesI8BondElem4bond9BondValueE10loadTuplesEv+0x597
> > > > >> >[0x4fc3b7]
> > > > >> > [2] _ZN17ComputeHomeTuplesI8BondElem4bond9BondValueE6doWorkEv+0x12ab
> > > > >> >[0x50a78b]
> > > > >> > [3]
> > > > >> >
> > > > >> _ZN19CkIndex_WorkDistrib31_call_enqueueBonds_LocalWorkMsgEPvP11WorkDistrib+0xd
> > > > >> >[0x6986bd]
> > > > >> > [4] CkDeliverMessageFree+0x30 [0x6eb1c0]
> > > > >> > [5] /home/lmartinez/namd-nofftw/./namd2 [0x6eb21f]
> > > > >> > [6] /home/lmartinez/namd-nofftw/./namd2 [0x6ec501]
> > > > >> > [7] /home/lmartinez/namd-nofftw/./namd2 [0x6eea9a]
> > > > >> > [8] /home/lmartinez/namd-nofftw/./namd2 [0x6eed48]
> > > > >> > [9] _Z15_processHandlerPvP11CkCoreState+0x130 [0x6efd68]
> > > > >> > [10] CmiHandleMessage+0xa5 [0x756e11]
> > > > >> > [11] CsdScheduleForever+0x75 [0x7571d2]
> > > > >> > [12] CsdScheduler+0x16 [0x757135]
> > > > >> > [13] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x153 [0x676ec3]
> > > > >> > [14] TclInvokeStringCommand+0x64 [0x2b770cb626a4]
> > > > >> > [15] TclEvalObjvInternal+0x1aa [0x2b770cb63d9a]
> > > > >> > [16] Tcl_EvalEx+0x397 [0x2b770cb64367]
> > > > >> > [17] Tcl_FSEvalFile+0x1ed [0x2b770cba5f2d]
> > > > >> > [18] Tcl_EvalFile+0x2e [0x2b770cba5fee]
> > > > >> > [19] _ZN9ScriptTcl3runEPc+0x24 [0x677104]
> > > > >> > [20] main+0x201 [0x4c2671]
> > > > >> > [21] __libc_start_main+0xf4 [0x2b770d178134]
> > > > >> > [22] __gxx_personality_v0+0x109 [0x4bf739]
> > > > >> >Fatal error on PE 0> segmentation violation
> > > > >> >
> > > > >> >2) Using provided binary amd64-TCP:
> > > > >> >
> > > > >> >ENERGY: 9800 20072.8949 13201.1162 1343.0570
> > > > >> >131.8154 -253024.4316 24209.4947 0.0000
> > > > >> >0.0000 53016.0637 -141049.9898 298.7174
> > > > >> >-140572.1009 -140570.0247 299.2188 -1871.3003
> > > > >> >-1682.6774 636056.0000 -1712.7694 -1711.5329
> > > > >> >
> > > > >> >------------- Processor 0 Exiting: Caught Signal ------------
> > > > >> >Signal: segmentation violation
> > > > >> >Suggestion: Try running with '++debug', or linking with '-memory paranoid'.
> > > > >> >Stack Traceback:
> > > > >> > [0] /lib/libc.so.6 [0x2ae403d345c0]
> > > > >> > [1] _int_malloc+0xb6 [0x778758]
> > > > >> > [2] mm_malloc+0x53 [0x7785f7]
> > > > >> > [3] malloc+0x16 [0x77c5fc]
> > > > >> > [4] _Znwm+0x1d [0x2ae403bc0b4d]
> > > > >> > [5] _ZN11ResizeArrayI6VectorEC1Ev+0x28 [0x6767e0]
> > > > >> > [6] __cxa_vec_ctor+0x46 [0x2ae403bc22a6]
> > > > >> > [7] _ZN14ProxyResultMsg6unpackEPv+0x62 [0x6e83ca]
> > > > >> > [8] _Z15CkUnpackMessagePP8envelope+0x28 [0x787036]
> > > > >> > [9] _Z15_processHandlerPvP11CkCoreState+0x412 [0x785db2]
> > > > >> > [10] CsdScheduleForever+0xa2 [0x7f2492]
> > > > >> > [11] CsdScheduler+0x1c [0x7f2090]
> > > > >> > [12] _ZN7BackEnd7suspendEv+0xb [0x4ba881]
> > > > >> > [13] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x122 [0x6fbfe0]
> > > > >> > [14] TclInvokeStringCommand+0x91 [0x80d9b8]
> > > > >> > [15] /home/lmartinez/teste-namd/./namd2 [0x843808]
> > > > >> > [16] Tcl_EvalEx+0x176 [0x843e4b]
> > > > >> > [17] Tcl_EvalFile+0x134 [0x83b854]
> > > > >> > [18] _ZN9ScriptTcl3runEPc+0x14 [0x6fb71e]
> > > > >> > [19] main+0x21b [0x4b6743]
> > > > >> > [20] __libc_start_main+0xf4 [0x2ae403d22134]
> > > > >> > [21] _ZNSt8ios_base4InitD1Ev+0x3a [0x4b28da]
> > > > >> >Fatal error on PE 0> segmentation violation
> > > > >> >
> > > > >> >3) Using amd64 (no-TCP)
> > > > >> >
> > > > >> >------------- Processor 0 Exiting: Caught Signal ------------
> > > > >> >Signal: segmentation violation
> > > > >> >Suggestion: Try running with '++debug', or linking with '-memory paranoid'.
> > > > >> >Stack Traceback:
> > > > >> > [0] /lib/libc.so.6 [0x2af45d3b35c0]
> > > > >> > [1] _ZN20ComputeNonbondedUtil9calc_pairEP9nonbonded+0x29f0 [0x5377b0]
> > > > >> > [2]
> > > > >> > _ZN20ComputeNonbondedPair7doForceEPP8CompAtomPP7Results+0x5da [0x52fc9e]
> > > > >> > [3] _ZN16ComputePatchPair6doWorkEv+0x85 [0x615dd1]
> > > > >> > [4] _ZN11WorkDistrib12enqueueWorkBEP12LocalWorkMsg+0x16 [0x728d16]
> > > > >> > [5]
> > > > >> >
> > > > >> _ZN19CkIndex_WorkDistrib31_call_enqueueWorkB_LocalWorkMsgEPvP11WorkDistrib+0xf
> > > > >> >[0x728cfd]
> > > > >> > [6] CkDeliverMessageFree+0x21 [0x786a6b]
> > > > >> > [7] _Z15_processHandlerPvP11CkCoreState+0x455 [0x786075]
> > > > >> > [8] CsdScheduleForever+0xa2 [0x7f18a2]
> > > > >> > [9] CsdScheduler+0x1c [0x7f14a0]
> > > > >> > [10] _ZN7BackEnd7suspendEv+0xb [0x4bab01]
> > > > >> > [11] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x122 [0x6fc260]
> > > > >> > [12] TclInvokeStringCommand+0x91 [0x80cc78]
> > > > >> > [13] /home/lmartinez/teste-namd/./NAMD_2.6_Linux-amd64/namd2 [0x842ac8]
> > > > >> > [14] Tcl_EvalEx+0x176 [0x84310b]
> > > > >> > [15] Tcl_EvalFile+0x134 [0x83ab14]
> > > > >> > [16] _ZN9ScriptTcl3runEPc+0x14 [0x6fb99e]
> > > > >> > [17] main+0x21b [0x4b69c3]
> > > > >> > [18] __libc_start_main+0xf4 [0x2af45d3a1134]
> > > > >> > [19] _ZNSt8ios_base4InitD1Ev+0x3a [0x4b2b5a]
> > > > >> >Fatal error on PE 0> segmentation violation
> > > >
> > >
> >
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:44:42 CST