Re: namd crash: Signal: segmentation violation

From: Brian Bennion (bennion1_at_llnl.gov)
Date: Tue May 22 2007 - 10:53:59 CDT

Hello Leandro,

I sent several messages last week asking about
the charm compilation. Did you get the charm++ test to work?
The fact that memory paranoid caught a bad memory
access leads me to believe the charm++ underlayer is not compiled correctly

Brian

At 06:17 AM 5/22/2007, Leandro Martínez wrote:

>Hi Gengbin, Brian and others,
>I have compiled namd for mpi, and the simulation also crashed, with
>the message given at the end of the email. The
>same simulation is running in our opteron
>cluster for more than four days now (more than two million steps).
>
>The apoa benchmark also crashed using mpi, this message
>was observed after step 41460 and was only
>(command line: mpirun n0-1 -np 4 ./namd2 apoa1.namd):
>
>-----------------------------------------------------------------------------
>One of the processes started by mpirun has exited with a nonzero exit
>code. This typically indicates that the process finished in error.
>If your process did not finish in error, be sure to include a "return
>0" or "exit(0)" in your C code before exiting the application.
>
>PID 8711 failed on node n0 (<http://10.0.0.100>10.0.0.100 ) due to signal 11.
>-----------------------------------------------------------------------------
>
>Using a binary compiled with "-memory os" and "-thread context"
>(running with +netpoll) the simulation (the apoa
>benchmark) crashes before the first timestep,
>with (same thing with our simulation):
>
>Info: Finished startup with 22184 kB of memory in use.
>------------- Processor 0 Exiting: Caught Signal ------------
>Signal: segmentation violation
>Suggestion: Try running with '++debug', or linking with '-memory paranoid'.
>Stack Traceback:
> [0] /lib/libc.so.6 [0x2b24659505c0]
> [1] _ZN10Controller9threadRunEPS_+0 [0x5d4520]
>Fatal error on PE 0> segmentation violation
>
>The best insight we had I think is the fact that the "memory
>paranoid" executable running our simulation
>does not crash in dual processor opteron
>machines, but crashes in dual-core machines when
>running with more than one process per node, before
>the first time step of the simulation. The apoa simulation
>does not crash before the first time step, but we haven't
>run it for long.
>I feel that there
>is some problem with memory sharing in dual core machines,
>I guess. Does anybody more has clusters running with this
>kind of architecture? If somebody does, which is the ammount
>of memory per node?
>
>Clearly we cannot rule out some low-level communication problem.
>However, as I said before, we have already changed every
>piece of hardware and software (not the power supplies of the
>cpus, I think, could be that for any odd reason?).
>
>Any clue?
>Leandro.
>
>------------------
>Crash of our benchmark using the mpi compiled namd with:
>mpirun n0-1 -np 4 ./namd2 test.namd
>
>ENERGY: 6100 20313.1442 13204.7818
> 1344.4563 138.9099 -253294.1741
> 24260.9686 0.0000 0.0000
>53056.0154 -140975.8980 298.9425
>-140488.7395 -140496.4680 299.1465
> -1706.3395 -1586.1351 636056.0000 -1691.7028 -1691.3721
>
>FATAL ERROR: pairlist i_upper mismatch!
>FATAL ERROR: See
><http://www.ks.uiuc.edu/Research/namd/bugreport.html>http://www.ks.uiuc.edu/Research/namd/bugreport.html
>------------- Processor 0 Exiting: Called CmiAbort ------------
>Reason: FATAL ERROR: pairlist i_upper mismatch!
>FATAL ERROR: See
><http://www.ks.uiuc.edu/Research/namd/bugreport.html>http://www.ks.uiuc.edu/Research/namd/bugreport.html
>
>Stack Traceback:
> [0] CmiAbort+0x2f [0x734220]
> [1] _Z8NAMD_bugPKc+0x4f [0x4b7aaf]
> [2] _ZN20ComputeNonbondedUtil9calc_pairEP9nonbonded+0x52c [0x54755c]
> [3]
> _ZN20ComputeNonbondedPair7doForceEPP8CompAtomPP7Results+0x580 [0x50a8f0]
> [4] _ZN16ComputePatchPair6doWorkEv+0xca [0x5bc5da]
> [5]
> _ZN19CkIndex_WorkDistrib31_call_enqueueWorkB_LocalWorkMsgEPvP11WorkDistrib+0xd
> [0x683a5d]
> [6] CkDeliverMessageFree+0x2e [0x6d2aa8]
> [7] ./namd2 [0x6d2b05]
> [8] ./namd2 [0x6d2b70]
> [9] ./namd2 [0x6d5af3]
> [10] ./namd2 [0x6d5e5a]
> [11] _Z15_processHandlerPvP11CkCoreState+0x118 [0x6d6c84]
> [12] CmiHandleMessage+0x7a [0x73545b]
> [13] CsdScheduleForever+0x5f [0x7356b6]
> [14] CsdScheduler+0x16 [0x73562f]
> [15] _ZN9ScriptTcl3runEPc+0x11a [0x664c4a]
> [16] main+0x125 [0x4b9ef5]
> [17] __libc_start_main+0xf4 [0x2ab5f4b1f134]
> [18] __gxx_personality_v0+0xf1 [0x4b72e9]

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:44:42 CST