Re: namd crash: Signal: segmentation violation

From: Gengbin Zheng (gzheng_at_ks.uiuc.edu)
Date: Tue May 22 2007 - 10:49:08 CDT

If it crashes running on a single node, I believe it has nothing to do
with your communication hardware. It may be due to some bad RAM or other
things? There may be some hardware sanity check program you can run
through to check hardware.
If you have more than one dual-core machines, try running NAMD on
different machines (each time with only a single node ) to see if there
is one machine works better.

Gengbin

Leandro Martínez wrote:

>
> Hi Gengbin, Brian and others,
> I have compiled namd for mpi, and the simulation also crashed, with
> the message given at the end of the email. The same simulation is
> running in our opteron cluster for more than four days now (more than
> two million steps).
>
> The apoa benchmark also crashed using mpi, this message
> was observed after step 41460 and was only
> (command line: mpirun n0-1 -np 4 ./namd2 apoa1.namd):
>
> -----------------------------------------------------------------------------
>
> One of the processes started by mpirun has exited with a nonzero exit
> code. This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a "return
> 0" or "exit(0)" in your C code before exiting the application.
>
> PID 8711 failed on node n0 (10.0.0.100 <http://10.0.0.100>) due to
> signal 11.
> -----------------------------------------------------------------------------
>
>
> Using a binary compiled with "-memory os" and "-thread context"
> (running with +netpoll) the simulation (the apoa benchmark) crashes
> before the first timestep, with (same thing with our simulation):
>
> Info: Finished startup with 22184 kB of memory in use.
> ------------- Processor 0 Exiting: Caught Signal ------------
> Signal: segmentation violation
> Suggestion: Try running with '++debug', or linking with '-memory
> paranoid'.
> Stack Traceback:
> [0] /lib/libc.so.6 [0x2b24659505c0]
> [1] _ZN10Controller9threadRunEPS_+0 [0x5d4520]
> Fatal error on PE 0> segmentation violation
>
> The best insight we had I think is the fact that the "memory
> paranoid" executable running our simulation
> does not crash in dual processor opteron
> machines, but crashes in dual-core machines when
> running with more than one process per node, before
> the first time step of the simulation. The apoa simulation
> does not crash before the first time step, but we haven't
> run it for long.
> I feel that there
> is some problem with memory sharing in dual core machines,
> I guess. Does anybody more has clusters running with this
> kind of architecture? If somebody does, which is the ammount
> of memory per node?
>
> Clearly we cannot rule out some low-level communication problem.
> However, as I said before, we have already changed every
> piece of hardware and software (not the power supplies of the
> cpus, I think, could be that for any odd reason?).
>
> Any clue?
> Leandro.
>
> ------------------
> Crash of our benchmark using the mpi compiled namd with:
> mpirun n0-1 -np 4 ./namd2 test.namd
>
> ENERGY: 6100 20313.1442 13204.7818 1344.4563
> 138.9099 -253294.1741 24260.9686 0.0000
> 0.0000 53056.0154 -140975.8980 298.9425
> -140488.7395 -140496.4680 299.1465 -1706.3395
> -1586.1351 636056.0000 -1691.7028 -1691.3721
>
> FATAL ERROR: pairlist i_upper mismatch!
> FATAL ERROR: See http://www.ks.uiuc.edu/Research/namd/bugreport.html
> ------------- Processor 0 Exiting: Called CmiAbort ------------
> Reason: FATAL ERROR: pairlist i_upper mismatch!
> FATAL ERROR: See http://www.ks.uiuc.edu/Research/namd/bugreport.html
>
> Stack Traceback:
> [0] CmiAbort+0x2f [0x734220]
> [1] _Z8NAMD_bugPKc+0x4f [0x4b7aaf]
> [2] _ZN20ComputeNonbondedUtil9calc_pairEP9nonbonded+0x52c [0x54755c]
> [3] _ZN20ComputeNonbondedPair7doForceEPP8CompAtomPP7Results+0x580
> [0x50a8f0]
> [4] _ZN16ComputePatchPair6doWorkEv+0xca [0x5bc5da]
> [5]
> _ZN19CkIndex_WorkDistrib31_call_enqueueWorkB_LocalWorkMsgEPvP11WorkDistrib+0xd
> [0x683a5d]
> [6] CkDeliverMessageFree+0x2e [0x6d2aa8]
> [7] ./namd2 [0x6d2b05]
> [8] ./namd2 [0x6d2b70]
> [9] ./namd2 [0x6d5af3]
> [10] ./namd2 [0x6d5e5a]
> [11] _Z15_processHandlerPvP11CkCoreState+0x118 [0x6d6c84]
> [12] CmiHandleMessage+0x7a [0x73545b]
> [13] CsdScheduleForever+0x5f [0x7356b6]
> [14] CsdScheduler+0x16 [0x73562f]
> [15] _ZN9ScriptTcl3runEPc+0x11a [0x664c4a]
> [16] main+0x125 [0x4b9ef5]
> [17] __libc_start_main+0xf4 [0x2ab5f4b1f134]
> [18] __gxx_personality_v0+0xf1 [0x4b72e9]
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 05:20:15 CST