Re: namd crash: Signal: segmentation violation

From: Brian Bennion (bennion1_at_llnl.gov)
Date: Thu May 17 2007 - 15:25:48 CDT

Hello,

I have seen similar behavior on one of our old AMD32 clusters. In my
case the cause of the trouble was narrowed down to the ethernet
connections. The messages would be corrupted somewhere in transit.

The errors are slightly different according to the stack traces you
gave. The middle one is stopping in /lib/libc trying to allocate
memory for messages while the other two are stopping in /lib64/libc
during nonbonded calculations.
Is the system and number of nodes the same for all three dumps that
you provided?

Do you have a debugger that you can analyze core files or attach to
running jobs that will fail?

The network cards would be the next hardware to check.

Brian

At 11:34 AM 5/17/2007, you wrote:
>Hi all,
>We are having systematic problems trying to run NAMD on our linux
>cluster. We are experienced users, we have already configured, used
>and run simulations with NAMD on a dual Opteron cluster running linux
>(not dual-core) for some years.
>Currently we are working with a test-cluster composed
>by two dual core AthlonX2 processors. We are trying to run a
>test simulation we already have run with no problems in the older
>cluster, but now the simulation always crashes with a message like:
>
>------------- Processor 0 Exiting: Caught Signal ------------
>Signal: segmentation violation
>Suggestion: Try running with '++debug', or linking with '-memory paranoid'.
>Stack Traceback:
> [0] /lib/libc.so.6 [0x2b770d18a5c0]
>... (see all bellow)
>
>The crashes are aleatory, in the sense that they occur after one,
>two, or twelve ours of simulation, without any correlation with
>simulation time or timestep.
>
>We are having this problem despite trying changing almost everything
>on our cluster achitecture and configuraton. We have tryied, for
>example:
>
>1. Running the simulation on the old Opteron 242 machines, which
>worked a few month ago when it ran Fedora 3.0, but when we upgraded to
>Fedora 6.0 we started having the same error. This cluster running
>with Etherboot. The other with PXE.
>
>2. We have tried two different cpu architectures for the Athlon X2 processors.
>
>3. We have tried Fedora 6.0, Ubuntu 7.04 and Gentoo.
>
>4. We have tried different executables of NAMD, meaning amd64-TCP,
>amd64 (no TCP), and we have compiled it in our cluster with gentoo.
>
>5. We have run memtest86, and found no problems.
>
>6. We even replaced switches.
>
>Basically we have no idea what to test anymore. We have seen that
>Cesar Avila had a similar problem once, but aparently he found that it
>was a memory problem which was detected with memtest.
>
>Any information regarding the clusters everyone is using here
>would be very useful, for example:
>
>1. Who is using 64 bit cluster machines? Particularly Amd64 machines.
>2. Which operating systems are being used? (Linux distributions
>and versions, for example).
>3. Which version of NAMD, kernels, compilers....
>4. Wether TCP is being used, the amount of memory of the master and nodes.
>5. Where the executables of charmm and namd compiled for the cluster
>or the binaries distributed were used?
>
>Examples of the log file of crashed simulations are below.
>
>We will be very grateful for any information regarding cluster
>experiences with namd, even if your cluster runs with no problems, for
>us to compare architectures.
>
>Thank you very much,
>Leandro.
>
>Examples of errors obtained:
>
>ENERGY: 339900 20337.5812 13287.7936 1320.7207
>123.9507 -257827.1413 25328.9455 0.0000
>0.0000 52952.3084 -144475.8412 298.3581
>-143994.1155 -144005.1537 298.0670 -686.5786
>-510.5889 636056.0000 -473.2155 -474.2152
>
>ERROR: Margin is too small for 1 atoms during timestep 339988.
>ERROR: Incorrect nonbonded forces and energies may be calculated!
>RESCALING VELOCITIES AT STEP 340000 FROM AVERAGE TEMPERATURE OF nan TO
>298.15 KELVIN.
>------------- Processor 0 Exiting: Caught Signal ------------
>Signal: segmentation violation
>Suggestion: Try running with '++debug', or linking with '-memory paranoid'.
>Stack Traceback:
> [0] /lib64/libc.so.6 [0x3a4da30210]
> [1]
> _ZN20ComputeNonbondedUtil16calc_self_energyEP9nonbonded+0x2132 [0x554568]
> [2] _ZN20ComputeNonbondedSelf7doForceEP8CompAtomP7Results+0x434 [0x52e82e]
> [3] _ZN12ComputePatch6doWorkEv+0x77 [0x615481]
> [4] _ZN11WorkDistrib12enqueueSelfAEP12LocalWorkMsg+0x16 [0x728c56]
> [5]
> _ZN19CkIndex_WorkDistrib31_call_enqueueSelfA_LocalWorkMsgEPvP11WorkDistrib+0xf
>[0x728c3d]
> [6] CkDeliverMessageFree+0x21 [0x786a6b]
> [7] _Z15_processHandlerPvP11CkCoreState+0x455 [0x786075]
> [8] CsdScheduleForever+0xa2 [0x7f18a2]
> [9] CsdScheduler+0x1c [0x7f14a0]
> [10] _ZN7BackEnd7suspendEv+0xb [0x4bab01]
> [11] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x122 [0x6fc260]
> [12] TclInvokeStringCommand+0x91 [0x80cc78]
> [13] /home/lmartinez/teste-novaswitch/./namd2 [0x842ac8]
> [14] Tcl_EvalEx+0x176 [0x84310b]
> [15] Tcl_EvalFile+0x134 [0x83ab14]
> [16] _ZN9ScriptTcl3runEPc+0x14 [0x6fb99e]
> [17] main+0x21b [0x4b69c3]
> [18] __libc_start_main+0xf4 [0x3a4da1da44]
> [19] _ZNSt8ios_base4InitD1Ev+0x3a [0x4b2b5a]
>Fatal error on PE 0> segmentation violation
>
>
>Other example:
>
>TIMING: 9800 CPU: 4002.45, 0.407105/step Wall: 4413.52,
>0.450273/step, 12506.4 hours remaining, 110348 kB of memory in use.
>ENERGY: 9800 20072.8949 13201.1162 1343.0570
>131.8154 -253024.4316 24209.4947 0.0000
>0.0000 53016.0637 -141049.9898 298.7174
>-140572.1009 -140570.0247 299.2188 -1871.3003
>-1682.6774 636056.0000 -1712.7694 -1711.5329
>
>------------- Processor 0 Exiting: Caught Signal ------------
>Signal: segmentation violation
>Suggestion: Try running with '++debug', or linking with '-memory paranoid'.
>Stack Traceback:
> [0] /lib/libc.so.6 [0x2ae403d345c0]
> [1] _int_malloc+0xb6 [0x778758]
> [2] mm_malloc+0x53 [0x7785f7]
> [3] malloc+0x16 [0x77c5fc]
> [4] _Znwm+0x1d [0x2ae403bc0b4d]
> [5] _ZN11ResizeArrayI6VectorEC1Ev+0x28 [0x6767e0]
> [6] __cxa_vec_ctor+0x46 [0x2ae403bc22a6]
> [7] _ZN14ProxyResultMsg6unpackEPv+0x62 [0x6e83ca]
> [8] _Z15CkUnpackMessagePP8envelope+0x28 [0x787036]
> [9] _Z15_processHandlerPvP11CkCoreState+0x412 [0x785db2]
> [10] CsdScheduleForever+0xa2 [0x7f2492]
> [11] CsdScheduler+0x1c [0x7f2090]
> [12] _ZN7BackEnd7suspendEv+0xb [0x4ba881]
> [13] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x122 [0x6fbfe0]
> [14] TclInvokeStringCommand+0x91 [0x80d9b8]
> [15] /home/lmartinez/teste-namd/./namd2 [0x843808]
> [16] Tcl_EvalEx+0x176 [0x843e4b]
> [17] Tcl_EvalFile+0x134 [0x83b854]
> [18] _ZN9ScriptTcl3runEPc+0x14 [0x6fb71e]
> [19] main+0x21b [0x4b6743]
> [20] __libc_start_main+0xf4 [0x2ae403d22134]
> [21] _ZNSt8ios_base4InitD1Ev+0x3a [0x4b28da]
>Fatal error on PE 0> segmentation violation
>
>
>
>
>LDB: LOAD: AVG 16.0673 MAX 19.091 MSGS: TOTAL 228 MAXC 14 MAXP 6 None
>LDB: LOAD: AVG 16.0673 MAX 16.3845 MSGS: TOTAL 228 MAXC 14 MAXP 6 Refine
>Info: Benchmark time: 18 CPUs 0.271986 s/step 3.14799 days/ns 45676 kB memory
>Info: Benchmark time: 18 CPUs 0.227697 s/step 2.63539 days/ns 45676 kB memory
>Info: Benchmark time: 18 CPUs 0.251674 s/step 2.91289 days/ns 45676 kB memory
>------------- Processor 0 Exiting: Caught Signal ------------
>Signal: segmentation violation
>Suggestion: Try running with '++debug', or linking with '-memory paranoid'.
>Stack Traceback:
> [0] /lib64/libc.so.6 [0x3b1e230210]
> [1] _ZN20ComputeNonbondedUtil9calc_pairEP9nonbonded+0x219a [0x536f5a]
> [2]
> _ZN20ComputeNonbondedPair7doForceEPP8CompAtomPP7Results+0x5da [0x52fc9e]
> [3] _ZN16ComputePatchPair6doWorkEv+0x85 [0x615dd1]
> [4] _ZN11WorkDistrib12enqueueWorkAEP12LocalWorkMsg+0x16 [0x728cd6]
> [5]
> _ZN19CkIndex_WorkDistrib31_call_enqueueWorkA_LocalWorkMsgEPvP11WorkDistrib+0xf
>[0x728cbd]
> [6] CkDeliverMessageFree+0x21 [0x786a6b]
> [7] _Z15_processHandlerPvP11CkCoreState+0x455 [0x786075]
> [8] CsdScheduleForever+0xa2 [0x7f18a2]
> [9] CsdScheduler+0x1c [0x7f14a0]
> [10] _ZN7BackEnd7suspendEv+0xb [0x4bab01]
> [11] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x122 [0x6fc260]
> [12] TclInvokeStringCommand+0x91 [0x80cc78]
> [13] /exports/home/cluster/teste-namd/./namd2 [0x842ac8]
> [14] Tcl_EvalEx+0x176 [0x84310b]
> [15] Tcl_EvalFile+0x134 [0x83ab14]
> [16] _ZN9ScriptTcl3runEPc+0x14 [0x6fb99e]
> [17] main+0x21b [0x4b69c3]
> [18] __libc_start_main+0xf4 [0x3b1e21da44]
> [19] _ZNSt8ios_base4InitD1Ev+0x3a [0x4b2b5a]
>Fatal error on PE 0> segmentation violation

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 05:20:14 CST