Re: namd crash: Signal: segmentation violation

From: Gengbin Zheng (gzheng_at_ks.uiuc.edu)
Date: Fri May 18 2007 - 16:28:54 CDT

Leandro,

also try both
-thread context -memory os
together, and run with +netpoll

if all fail, try an MPI (mpi-linux-amd64) build and see if it works.
Also please run with the apoa1 benchmark which we know it should run if
Charm++ and low-level communication layer are in good shape. Since the
error might be some non-Charm++ issues (such as bugs related to physics
either in NAMD or your test case, I am not an expert on that).

Gengbin

Leandro Martínez wrote:

> Hi Gengbin,
>
> Using the "-thread context" and "-memory paranoid" together
> didn't help, the simulation crashed as before (before the
> first time-step). Did you mean to remove the "paranoid"
> option for this test?
>
> Using the "-memory os" option it seems that the "-memory paranoid"
> option was overwritten and now the simulation started running,
> we don't know if it will be stable for long yet. We are using the
> +netpoll option.
>
> Thanks,
> Leandro.
>
>
>
>
> On 5/18/07, Gengbin Zheng <gzheng_at_ks.uiuc.edu> wrote:
>
>>
>> Hi,
>>
>> Not sure what is the problem, but some quick things to test:
>>
>> 1. link NAMD with
>> -thread context
>> see if that helps
>>
>> 2. link NAMD with
>> -memory os
>> and then run NAMD with +netpoll runtime option.
>>
>> Gengbin
>>
>> Leandro Martínez wrote:
>>
>> > Adding more information:
>> >
>> > The binaries compiled with -memory paranoid crash in all
>> > dual-core cpus we have tried, when using more than
>> > one process even the machines that run
>> > simulations fine. However, in our Opteron 242 machines,
>> > which are not dual-core (but have two independent
>> > processors), the simulation does not crash. It seems
>> > that the memory paranoid option is sensitive to the
>> > dual-core architecture.
>> >
>> > Here,
>> >
>> > http://limes.iqm.unicamp.br/~lmartinez/namd-memoryparanoid.tar.gz
>> >
>> > you can get the binaries compiled with memory paranoid
>> > we are using and the other files of our test simulation.
>> > For testing this crash just unpack the
>> > file, go into the namd-memoryparanoid directory and run
>> >
>> > ./charmrun ++local +p2 ++verbose ./namd2 teste-gentoo.namd
>> >
>> > In the dual-core machines it doesn't get to first time-step.
>> > Thanks,
>> > Leandro.
>> >
>> >
>> >
>> > On 5/18/07, Leandro Martínez <leandromartinez98_at_gmail.com> wrote:
>> >
>> >> Hi Brian and others,
>> >> Well, in the Athlon 64 cluster the simulation with the old
>> >> 32 bit binary has also crashed, with the following message:
>> >>
>> >> ------------- Processor 2 Exiting: Caught Signal ------------
>> >> Signal: segmentation violation
>> >> Suggestion: Try running with '++debug', or linking with '-memory
>> >> paranoid'.
>> >> req_handle_abort called
>> >> Fatal error on PE 2> segmentation violation
>> >>
>> >> We have then compiled namd using the -memory paranoid
>> >> option, using amd64 and g++ options (NAMD 2.6 for
>> >> Linux-amd64--memory). The good news is that
>> >> the simulations crash faster and reproductivelly, with
>> >> the message:
>> >>
>> >> Info: Entering startup phase 8 with 60060 kB of memory in use.
>> >> Info: Finished startup with 60060 kB of memory in use.
>> >> ------------- Processor 0 Exiting: Caught Signal ------------
>> >> Signal: segmentation violation
>> >> Suggestion: Try running with '++debug', or linking with '-memory
>> >> paranoid'.
>> >> Stack Traceback:
>> >> [0] /lib/libc.so.6 [0x2ae55a6745c0]
>> >> [1] _ZN10Controller9threadRunEPS_+0 [0x5d8980]
>> >> Fatal error on PE 0> segmentation violation
>> >>
>> >> This crash also occurs if we run with two processors locally
>> >> in the master (charmrun ++local +p2) and if we
>> >> run with two processes and two nodes (charmrun +p2 ++nodedelist
>> >> ./nodelist) with two nodes (the master plus one) in the nodelist.
>> >>
>> >> However, it seems
>> >> not to crash if we run with only one processor locally
>> >> (charmrun ++local +p1) or if we don't use charmrun (./namd2 ...).
>> >>
>> >> Recalling that we are dealing with
>> >> dual core processors.
>> >>
>> >> Thanks for any help,
>> >> Leandro.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On 5/18/07, Leandro Martínez <leandromartinez98_at_gmail.com> wrote:
>> >> > Sorry, wrong again. The namd2 32 bit binary of our Fedora cluster
>> >> > uses 32 bit libraries, as it should be.
>> >> > Leandro.
>> >> >
>> >> > On 5/18/07, Leandro Martínez <leandromartinez98_at_gmail.com> wrote:
>> >> > > Hi Brian,
>> >> > > Actually I have checked now and the /lib directory
>> >> > > is only a symbolic link to /lib64. Therefore /lib/libc.so.6
>> >> > > is /lib64/libc.so.6. Therefore the use of the wrong library
>> >> > > is not the problem.
>> >> > > Sorry about this one.
>> >> > >
>> >> > > Curiously, however, the 32 bit binary of namd 2.5 we are
>> >> > > trying now uses 64 bit libraries on our fedora 6.0 Opteron
>> cluster
>> >> > > (running for 36 hours now...), but uses 32 bit libraries on
>> >> > > the Gentoo Athlon 64 cluster (running for four hours now),
>> >> > > as shown by "ldd ./namd2".
>> >> > >
>> >> > > If these runs with the namd 2.5 32 bit binary do not crash,
>> >> > > then the problem must be related to the new namd version
>> >> > > and some library, I guess, in some very particular way that
>> >> > > most people does not see it.
>> >> > >
>> >> > > I going to write again after we are finally confident that these
>> >> > > simulations with "old" binaries are stable.
>> >> > >
>> >> > > Thanks,
>> >> > > Leandro.
>> >> > >
>> >> > >
>> >> > >
>> >> > >
>> >> > >
>> >> > > On 5/18/07, Brian Bennion <bennion1_at_llnl.gov> wrote:
>> >> > > > HI Leandro,
>> >> > > > My guess is that the amd64 binary was built on a
>> >> > > > machine where lib64/libc was the default
>> >> > > > path. Now when running it on your Athlon64 which
>> >> > > > appears to be capable of both 32 and 64bit modes
>> >> > > > it is finding /lib/libc as the default
>> >> > > > path. Just conjecture on my part though.
>> >> > > > looking at the link line in your compile output I
>> >> > > > can't find where it calls the system libraries
>> >> > > > like libc.so.6. Can you send the makefile?
>> >> > > >
>> >> > > > Confirm that /lib64/libc.so.6 exists on the new Athlon64
>> cluster.
>> >> > > >
>> >> > > > My next guess would be to try unloading
>> >> > > > /lib/libc from the loader config file
>> >> > > > "ld_config.in" It should be in your /etc dir in
>> >> > > > redhat on thealthlon64 machines.
>> >> > > >
>> >> > > > Brian
>> >> > > >
>> >> > > > At 05:25 AM 5/18/2007, Leandro Martínez wrote:
>> >> > > > >Hi Brian,
>> >> > > > >
>> >> > > > >>/lib/libc.so.6 is being called and not /lib64/libc.so.6
>> >> > > > >
>> >> > > > >That's a good point. I have no idea why this library is being
>> >> called
>> >> > > > >instead of the 64 bit one, this happens with all namd 2.6
>> >> binaries
>> >> > > > >we tested.
>> >> > > > >
>> >> > > > >- The output of the namd build is at
>> >> > > > >http://limes.iqm.unicamp.br/~lmartinez/compilation.log
>> >> > > > >
>> >> > > > >- When I say that we are using a single node I meant the
>> >> > > > >master and one node. Both Athlon 64 dual core machines,
>> connected
>> >> > > > >through a gigabit network. Thus the simulations are running
>> >> > > > >with 4 cpus. When the simulation run only on the
>> >> > > > >master (locally) it doesn't crash. The node is diskless.
>> >> > > > >
>> >> > > > >- The full set of files of our test is available at
>> >> > > > >http://limes.iqm.unicamp.br/~lmartinez/namd-test.tar.gz
>> >> > > > >Uncompressing it you will get a directory named namd-test.
>> >> > > > >Inside it there is a "test.namd" file, which is the input file
>> >> for
>> >> > > > >namd. The binaries are not there, I can send you some of
>> >> > > > >them if you like, but actually most tests were done with
>> >> > > > >the binaries available at the NAMD site. We are using, for
>> >> > > > >example,
>> >> > > > >charmrun +p4 ++nodelist ./nodelist ++remote-shell ssh ./namd2
>> >> test-namd
>> >> > > > >to run the simulation.
>> >> > > > >
>> >> > > > >I will try debugging some run which crashes.
>> >> > > > >
>> >> > > > >A relevant information is that I am now running the same test
>> >> > > > >with a 32 bit binary of NAMD 2.5 (Actually NAMD 2.5 for
>> >> Linux-i686-TCP),
>> >> > > > >in our opteron cluster with Fedora 6.0 (15 nodes, 30 cpus)
>> >> > > > >and it is running for 30 hours. A few more days and we will be
>> >> happy on
>> >> > > > >that at least. But note that we have tried running the same
>> >> simulation
>> >> > > > >with the NAMD 2.6 for Linux-amd64 binary, in the same
>> >> > > > >cluster, and we got the segmentation violation error. I will
>> >> be trying
>> >> > > > >to run the old binary in our new Athlon 64 cluster to see if
>> >> we get
>> >> > > > >a more stable run.
>> >> > > > >
>> >> > > > >Thanks again,
>> >> > > > >Leandro.
>> >> > > > >
>> >> > > > >
>> >> > > > >
>> >> > > > >
>> >> > > > >
>> >> > > > >
>> >> > > > >>
>> >> > > > >>When you say one node do you mean that the namd
>> >> > > > >>job is running on a single computer with more
>> >> > > > >>than 1 cpu? If this is true then the networking
>> >> > > > >>problem is probably a red herring.
>> >> > > > >>Do you mind sending appropriate files for me to test on our
>> >> opteron clusters?
>> >> > > > >>
>> >> > > > >>The debugger could be gdb if running on one node
>> >> > > > >>We typically use totalview
>> >> > > > >>There are others that might be more friendly.
>> >> > > > >>
>> >> > > > >>I would also suggest a NAMD build that has a -g
>> >> > > > >>switch turned on to collect more debug.
>> >> > > > >>
>> >> > > > >>At 01:39 PM 5/17/2007, Leandro Martínez wrote:
>> >> > > > >> >Hi Brian, thank you very much for the answer. The stack
>> >> traces I sent
>> >> > > > >> >before were not obtained in the same hardware. The ones
>> >> that I'm
>> >> > > > >> >sending below are both from the same mini-cluster (one
>> >> node), but
>> >> > > > >> >different namd binaries, which I specify. As you will see,
>> >> they are also
>> >> > > > >> >the same kind of errors, but each one with a different
>> >> stack trace.
>> >> > > > >> >
>> >> > > > >> >I also think there is some corruption of the
>> >> > > > >> >data in transit, but I have no idea how to
>> >> > > > >> track or solve it. We have in some
>> >> > > > >> >of our previous attempts obtained errors in which "a atom
>> >> moving too
>> >> > > > >> >fast was detected", but the velocity was absurd only in one
>> >> the three
>> >> > > > >> >components, thus sugesting that that was corrupted data
>> >> rather than
>> >> > > > >> >an actual simulation issue. In our present configuration we
>> >> have not
>> >> > > > >> >seen this problem anymore, but only these "libc.so.6"
>> issues.
>> >> > > > >> >
>> >> > > > >> >The network cards we have already changed once, and this
>> >> problem
>> >> > > > >> >appeared in two different clusters, so I wouldn't really
>> >> bet that it is
>> >> > > > >> >a hardware problem.
>> >> > > > >> >
>> >> > > > >> >Can you suggest some debugger? I'm not familiar with those.
>> >> > > > >> >I have ran the simulation using ++debug but I couldn't
>> get any
>> >> > > > >> >meaningful information.
>> >> > > > >> >
>> >> > > > >> >Thanks,
>> >> > > > >> >Leandro.
>> >> > > > >> >
>> >> > > > >> >These are the stack traces of three runs in the same
>> >> cluster, same
>> >> > > > >> >nodes, different binaries:
>> >> > > > >> >
>> >> > > > >> >1) Using home-compiled namd binary with no fftw:
>> >> > > > >> >
>> >> > > > >> >ENERGY: 10600 19987.4183 13118.5784
>> 1321.5175
>> >> > > > >> >124.6337 -252623.9701 24015.9942 0.0000
>> >> > > > >> >0.0000 52674.4657 -141381.3625 296.7927
>> >> > > > >> >-140902.1798 -140903.3821 298.0487
>> -1714.5778
>> >> > > > >> >-1687.4130 636056.0000 -1638.2391 -1637.8758
>> >> > > > >> >
>> >> > > > >> >------------- Processor 0 Exiting: Caught Signal
>> ------------
>> >> > > > >> >Signal: segmentation violation
>> >> > > > >> >Suggestion: Try running with '++debug', or linking with
>> >> '-memory paranoid'.
>> >> > > > >> >Stack Traceback:
>> >> > > > >> > [0] /lib/libc.so.6 [0x2b770d18a5c0]
>> >> > > > >> > [1]
>> >> _ZN17ComputeHomeTuplesI8BondElem4bond9BondValueE10loadTuplesEv+0x597
>> >> > > > >> >[0x4fc3b7]
>> >> > > > >> > [2]
>> >> _ZN17ComputeHomeTuplesI8BondElem4bond9BondValueE6doWorkEv+0x12ab
>> >> > > > >> >[0x50a78b]
>> >> > > > >> > [3]
>> >> > > > >> >
>> >> > > > >>
>> >>
>> _ZN19CkIndex_WorkDistrib31_call_enqueueBonds_LocalWorkMsgEPvP11WorkDistrib+0xd
>>
>> >>
>> >> > > > >> >[0x6986bd]
>> >> > > > >> > [4] CkDeliverMessageFree+0x30 [0x6eb1c0]
>> >> > > > >> > [5] /home/lmartinez/namd-nofftw/./namd2 [0x6eb21f]
>> >> > > > >> > [6] /home/lmartinez/namd-nofftw/./namd2 [0x6ec501]
>> >> > > > >> > [7] /home/lmartinez/namd-nofftw/./namd2 [0x6eea9a]
>> >> > > > >> > [8] /home/lmartinez/namd-nofftw/./namd2 [0x6eed48]
>> >> > > > >> > [9] _Z15_processHandlerPvP11CkCoreState+0x130 [0x6efd68]
>> >> > > > >> > [10] CmiHandleMessage+0xa5 [0x756e11]
>> >> > > > >> > [11] CsdScheduleForever+0x75 [0x7571d2]
>> >> > > > >> > [12] CsdScheduler+0x16 [0x757135]
>> >> > > > >> > [13] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x153
>> >> [0x676ec3]
>> >> > > > >> > [14] TclInvokeStringCommand+0x64 [0x2b770cb626a4]
>> >> > > > >> > [15] TclEvalObjvInternal+0x1aa [0x2b770cb63d9a]
>> >> > > > >> > [16] Tcl_EvalEx+0x397 [0x2b770cb64367]
>> >> > > > >> > [17] Tcl_FSEvalFile+0x1ed [0x2b770cba5f2d]
>> >> > > > >> > [18] Tcl_EvalFile+0x2e [0x2b770cba5fee]
>> >> > > > >> > [19] _ZN9ScriptTcl3runEPc+0x24 [0x677104]
>> >> > > > >> > [20] main+0x201 [0x4c2671]
>> >> > > > >> > [21] __libc_start_main+0xf4 [0x2b770d178134]
>> >> > > > >> > [22] __gxx_personality_v0+0x109 [0x4bf739]
>> >> > > > >> >Fatal error on PE 0> segmentation violation
>> >> > > > >> >
>> >> > > > >> >2) Using provided binary amd64-TCP:
>> >> > > > >> >
>> >> > > > >> >ENERGY: 9800 20072.8949 13201.1162
>> 1343.0570
>> >> > > > >> >131.8154 -253024.4316 24209.4947 0.0000
>> >> > > > >> >0.0000 53016.0637 -141049.9898 298.7174
>> >> > > > >> >-140572.1009 -140570.0247 299.2188
>> -1871.3003
>> >> > > > >> >-1682.6774 636056.0000 -1712.7694 -1711.5329
>> >> > > > >> >
>> >> > > > >> >------------- Processor 0 Exiting: Caught Signal
>> ------------
>> >> > > > >> >Signal: segmentation violation
>> >> > > > >> >Suggestion: Try running with '++debug', or linking with
>> >> '-memory paranoid'.
>> >> > > > >> >Stack Traceback:
>> >> > > > >> > [0] /lib/libc.so.6 [0x2ae403d345c0]
>> >> > > > >> > [1] _int_malloc+0xb6 [0x778758]
>> >> > > > >> > [2] mm_malloc+0x53 [0x7785f7]
>> >> > > > >> > [3] malloc+0x16 [0x77c5fc]
>> >> > > > >> > [4] _Znwm+0x1d [0x2ae403bc0b4d]
>> >> > > > >> > [5] _ZN11ResizeArrayI6VectorEC1Ev+0x28 [0x6767e0]
>> >> > > > >> > [6] __cxa_vec_ctor+0x46 [0x2ae403bc22a6]
>> >> > > > >> > [7] _ZN14ProxyResultMsg6unpackEPv+0x62 [0x6e83ca]
>> >> > > > >> > [8] _Z15CkUnpackMessagePP8envelope+0x28 [0x787036]
>> >> > > > >> > [9] _Z15_processHandlerPvP11CkCoreState+0x412 [0x785db2]
>> >> > > > >> > [10] CsdScheduleForever+0xa2 [0x7f2492]
>> >> > > > >> > [11] CsdScheduler+0x1c [0x7f2090]
>> >> > > > >> > [12] _ZN7BackEnd7suspendEv+0xb [0x4ba881]
>> >> > > > >> > [13] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x122
>> >> [0x6fbfe0]
>> >> > > > >> > [14] TclInvokeStringCommand+0x91 [0x80d9b8]
>> >> > > > >> > [15] /home/lmartinez/teste-namd/./namd2 [0x843808]
>> >> > > > >> > [16] Tcl_EvalEx+0x176 [0x843e4b]
>> >> > > > >> > [17] Tcl_EvalFile+0x134 [0x83b854]
>> >> > > > >> > [18] _ZN9ScriptTcl3runEPc+0x14 [0x6fb71e]
>> >> > > > >> > [19] main+0x21b [0x4b6743]
>> >> > > > >> > [20] __libc_start_main+0xf4 [0x2ae403d22134]
>> >> > > > >> > [21] _ZNSt8ios_base4InitD1Ev+0x3a [0x4b28da]
>> >> > > > >> >Fatal error on PE 0> segmentation violation
>> >> > > > >> >
>> >> > > > >> >3) Using amd64 (no-TCP)
>> >> > > > >> >
>> >> > > > >> >------------- Processor 0 Exiting: Caught Signal
>> ------------
>> >> > > > >> >Signal: segmentation violation
>> >> > > > >> >Suggestion: Try running with '++debug', or linking with
>> >> '-memory paranoid'.
>> >> > > > >> >Stack Traceback:
>> >> > > > >> > [0] /lib/libc.so.6 [0x2af45d3b35c0]
>> >> > > > >> > [1]
>> >> _ZN20ComputeNonbondedUtil9calc_pairEP9nonbonded+0x29f0 [0x5377b0]
>> >> > > > >> > [2]
>> >> > > > >> >
>> >> _ZN20ComputeNonbondedPair7doForceEPP8CompAtomPP7Results+0x5da
>> >> [0x52fc9e]
>> >> > > > >> > [3] _ZN16ComputePatchPair6doWorkEv+0x85 [0x615dd1]
>> >> > > > >> > [4] _ZN11WorkDistrib12enqueueWorkBEP12LocalWorkMsg+0x16
>> >> [0x728d16]
>> >> > > > >> > [5]
>> >> > > > >> >
>> >> > > > >>
>> >>
>> _ZN19CkIndex_WorkDistrib31_call_enqueueWorkB_LocalWorkMsgEPvP11WorkDistrib+0xf
>>
>> >>
>> >> > > > >> >[0x728cfd]
>> >> > > > >> > [6] CkDeliverMessageFree+0x21 [0x786a6b]
>> >> > > > >> > [7] _Z15_processHandlerPvP11CkCoreState+0x455 [0x786075]
>> >> > > > >> > [8] CsdScheduleForever+0xa2 [0x7f18a2]
>> >> > > > >> > [9] CsdScheduler+0x1c [0x7f14a0]
>> >> > > > >> > [10] _ZN7BackEnd7suspendEv+0xb [0x4bab01]
>> >> > > > >> > [11] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x122
>> >> [0x6fc260]
>> >> > > > >> > [12] TclInvokeStringCommand+0x91 [0x80cc78]
>> >> > > > >> > [13]
>> >> /home/lmartinez/teste-namd/./NAMD_2.6_Linux-amd64/namd2 [0x842ac8]
>> >> > > > >> > [14] Tcl_EvalEx+0x176 [0x84310b]
>> >> > > > >> > [15] Tcl_EvalFile+0x134 [0x83ab14]
>> >> > > > >> > [16] _ZN9ScriptTcl3runEPc+0x14 [0x6fb99e]
>> >> > > > >> > [17] main+0x21b [0x4b69c3]
>> >> > > > >> > [18] __libc_start_main+0xf4 [0x2af45d3a1134]
>> >> > > > >> > [19] _ZNSt8ios_base4InitD1Ev+0x3a [0x4b2b5a]
>> >> > > > >> >Fatal error on PE 0> segmentation violation
>> >> > > >
>> >> > >
>> >> >
>> >>
>>
>>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:44:42 CST