From: Brian Bennion (bennion1_at_llnl.gov)
Date: Thu May 17 2007 - 17:06:15 CDT
Hi Leandro,
I am still a bit confused about why 
/lib/libc.so.6 is being called and not /lib64/libc.so.6
Maybe it doesn't matter.  Jim Phillips might know.
Can you post your output from your build of 
namd.  I just want to see what libraries are being chosen.
When you say one node do you mean that the namd 
job is running on a single computer with more 
than 1 cpu?  If this is true then the networking 
problem is probably a red herring.
Do you mind sending appropriate files for me to test on our opteron clusters?
The debugger could be gdb if running on one node
We typically use totalview
There are others that might be more friendly.
I would also suggest a NAMD build that has a -g 
switch turned on to collect more debug.
At 01:39 PM 5/17/2007, Leandro Martínez wrote:
>Hi Brian, thank you very much for the answer. The stack traces I sent
>before were not obtained in the same hardware. The ones that I'm
>sending below are both from the same mini-cluster (one node), but
>different namd binaries, which I specify. As you will see, they are also
>the same kind of errors, but each one with a different stack trace.
>
>I also think there is some corruption of the
>data in transit, but I have no idea how to track or solve it. We have in some
>of our previous attempts obtained errors in which "a atom moving too
>fast was detected", but the velocity was absurd only in one the three
>components, thus sugesting that that was corrupted data rather than
>an actual simulation issue. In our present configuration we have not
>seen this problem anymore, but only these "libc.so.6" issues.
>
>The network cards we have already changed once, and this problem
>appeared in two different clusters, so I wouldn't really bet that it is
>a hardware problem.
>
>Can you suggest some debugger? I'm not familiar with those.
>I have ran the simulation using ++debug but I couldn't get any
>meaningful information.
>
>Thanks,
>Leandro.
>
>These are the stack traces of three runs in the same cluster, same
>nodes, different binaries:
>
>1) Using home-compiled namd binary with no fftw:
>
>ENERGY:   10600     19987.4183     13118.5784      1321.5175
>124.6337        -252623.9701     24015.9942         0.0000
>0.0000     52674.4657        -141381.3625       296.7927
>-140902.1798   -140903.3821       298.0487          -1714.5778
>-1687.4130    636056.0000     -1638.2391     -1637.8758
>
>------------- Processor 0 Exiting: Caught Signal ------------
>Signal: segmentation violation
>Suggestion: Try running with '++debug', or linking with '-memory paranoid'.
>Stack Traceback:
>  [0] /lib/libc.so.6 [0x2b770d18a5c0]
>  [1] _ZN17ComputeHomeTuplesI8BondElem4bond9BondValueE10loadTuplesEv+0x597
>[0x4fc3b7]
>  [2] _ZN17ComputeHomeTuplesI8BondElem4bond9BondValueE6doWorkEv+0x12ab
>[0x50a78b]
>  [3] 
> _ZN19CkIndex_WorkDistrib31_call_enqueueBonds_LocalWorkMsgEPvP11WorkDistrib+0xd
>[0x6986bd]
>  [4] CkDeliverMessageFree+0x30  [0x6eb1c0]
>  [5] /home/lmartinez/namd-nofftw/./namd2 [0x6eb21f]
>  [6] /home/lmartinez/namd-nofftw/./namd2 [0x6ec501]
>  [7] /home/lmartinez/namd-nofftw/./namd2 [0x6eea9a]
>  [8] /home/lmartinez/namd-nofftw/./namd2 [0x6eed48]
>  [9] _Z15_processHandlerPvP11CkCoreState+0x130  [0x6efd68]
>  [10] CmiHandleMessage+0xa5  [0x756e11]
>  [11] CsdScheduleForever+0x75  [0x7571d2]
>  [12] CsdScheduler+0x16  [0x757135]
>  [13] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x153  [0x676ec3]
>  [14] TclInvokeStringCommand+0x64  [0x2b770cb626a4]
>  [15] TclEvalObjvInternal+0x1aa  [0x2b770cb63d9a]
>  [16] Tcl_EvalEx+0x397  [0x2b770cb64367]
>  [17] Tcl_FSEvalFile+0x1ed  [0x2b770cba5f2d]
>  [18] Tcl_EvalFile+0x2e  [0x2b770cba5fee]
>  [19] _ZN9ScriptTcl3runEPc+0x24  [0x677104]
>  [20] main+0x201  [0x4c2671]
>  [21] __libc_start_main+0xf4  [0x2b770d178134]
>  [22] __gxx_personality_v0+0x109  [0x4bf739]
>Fatal error on PE 0> segmentation violation
>
>2) Using provided binary amd64-TCP:
>
>ENERGY:    9800     20072.8949     13201.1162      1343.0570
>131.8154        -253024.4316     24209.4947         0.0000
>0.0000     53016.0637        -141049.9898       298.7174
>-140572.1009   -140570.0247       299.2188          -1871.3003
>-1682.6774    636056.0000     -1712.7694     -1711.5329
>
>------------- Processor 0 Exiting: Caught Signal ------------
>Signal: segmentation violation
>Suggestion: Try running with '++debug', or linking with '-memory paranoid'.
>Stack Traceback:
>  [0] /lib/libc.so.6 [0x2ae403d345c0]
>  [1] _int_malloc+0xb6  [0x778758]
>  [2] mm_malloc+0x53  [0x7785f7]
>  [3] malloc+0x16  [0x77c5fc]
>  [4] _Znwm+0x1d  [0x2ae403bc0b4d]
>  [5] _ZN11ResizeArrayI6VectorEC1Ev+0x28  [0x6767e0]
>  [6] __cxa_vec_ctor+0x46  [0x2ae403bc22a6]
>  [7] _ZN14ProxyResultMsg6unpackEPv+0x62  [0x6e83ca]
>  [8] _Z15CkUnpackMessagePP8envelope+0x28  [0x787036]
>  [9] _Z15_processHandlerPvP11CkCoreState+0x412  [0x785db2]
>  [10] CsdScheduleForever+0xa2  [0x7f2492]
>  [11] CsdScheduler+0x1c  [0x7f2090]
>  [12] _ZN7BackEnd7suspendEv+0xb  [0x4ba881]
>  [13] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x122  [0x6fbfe0]
>  [14] TclInvokeStringCommand+0x91  [0x80d9b8]
>  [15] /home/lmartinez/teste-namd/./namd2 [0x843808]
>  [16] Tcl_EvalEx+0x176  [0x843e4b]
>  [17] Tcl_EvalFile+0x134  [0x83b854]
>  [18] _ZN9ScriptTcl3runEPc+0x14  [0x6fb71e]
>  [19] main+0x21b  [0x4b6743]
>  [20] __libc_start_main+0xf4  [0x2ae403d22134]
>  [21] _ZNSt8ios_base4InitD1Ev+0x3a  [0x4b28da]
>Fatal error on PE 0> segmentation violation
>
>3) Using amd64 (no-TCP)
>
>------------- Processor 0 Exiting: Caught Signal ------------
>Signal: segmentation violation
>Suggestion: Try running with '++debug', or linking with '-memory paranoid'.
>Stack Traceback:
>  [0] /lib/libc.so.6 [0x2af45d3b35c0]
>  [1] _ZN20ComputeNonbondedUtil9calc_pairEP9nonbonded+0x29f0  [0x5377b0]
>  [2] 
> _ZN20ComputeNonbondedPair7doForceEPP8CompAtomPP7Results+0x5da  [0x52fc9e]
>  [3] _ZN16ComputePatchPair6doWorkEv+0x85  [0x615dd1]
>  [4] _ZN11WorkDistrib12enqueueWorkBEP12LocalWorkMsg+0x16  [0x728d16]
>  [5] 
> _ZN19CkIndex_WorkDistrib31_call_enqueueWorkB_LocalWorkMsgEPvP11WorkDistrib+0xf
>[0x728cfd]
>  [6] CkDeliverMessageFree+0x21  [0x786a6b]
>  [7] _Z15_processHandlerPvP11CkCoreState+0x455  [0x786075]
>  [8] CsdScheduleForever+0xa2  [0x7f18a2]
>  [9] CsdScheduler+0x1c  [0x7f14a0]
>  [10] _ZN7BackEnd7suspendEv+0xb  [0x4bab01]
>  [11] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x122  [0x6fc260]
>  [12] TclInvokeStringCommand+0x91  [0x80cc78]
>  [13] /home/lmartinez/teste-namd/./NAMD_2.6_Linux-amd64/namd2 [0x842ac8]
>  [14] Tcl_EvalEx+0x176  [0x84310b]
>  [15] Tcl_EvalFile+0x134  [0x83ab14]
>  [16] _ZN9ScriptTcl3runEPc+0x14  [0x6fb99e]
>  [17] main+0x21b  [0x4b69c3]
>  [18] __libc_start_main+0xf4  [0x2af45d3a1134]
>  [19] _ZNSt8ios_base4InitD1Ev+0x3a  [0x4b2b5a]
>Fatal error on PE 0> segmentation violation
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:44:41 CST