Re: namd crash: Signal: segmentation violation

From: Brian Bennion (bennion1_at_llnl.gov)
Date: Tue May 22 2007 - 14:39:43 CDT

Thank you for the information. This is indeed is a interesting problem.

Were did you get the charm++ source code? Was it
CVS or part of the original download of NAMD?

The checkpoint code appears to be newer addition
to charm++ but NAMD doesn't utilize it yet.

To test weird memory handling, it might be worth
while to use an external memory checking code like purify or electric fence.

efence is the the easiest to use. On rhel
machines you just need to add a link line
-lefence and maybe a path to the library.

Add this to charm++ linking and namd linking
steps with -g turned on to get the exact place in
the code that violates reserved memory.

Just my $0.02 at this point.
Brian

At 10:55 AM 5/22/2007, Leandro Martínez wrote:

>Gengbin,
>It is true that the memory paranoid binary fails to run on two processors
>on the same machine, however we can run the simulations stably in a
>single node. Now I think that this may be a problem with the memory
>paranoid binary particularly.
>
>Brian, sorry not telling this before. I have run all the tests of charm++
>now. There is one test of charm (and mpirun)
>that clearly has problems. The test is the "checkpoint" in
>directory tests/charm++/chkpt
>
>The log file of one of the compilations is at
><http://limes.iqm.unicamp.br/~lmartinez/charm++_build.log>http://limes.iqm.unicamp.br/~lmartinez/charm++_build.log
>There are some errors related to fortran 90 files, but
>it says that charm++ was built successfully.
>
>
>The error does not occur if we run locally with
>two processors (charmrun ++local +p2).
>
>The error persists if we used binaries that run fine on other machines.
>
>We have tested this with two different charm++ compilations, one for
>net-linux-amd64 and other for net-linux-amd64-smp-tcp, and we have
>also tried "mpirun". The errors are the following, and could be related
>to our problem:
>
>Using net-linux-amd64 (from tests/charm++/chkpt) directory:
>The command is: ./charmrun +p4 ++nodelist
>./nodelist ++remote-shell ssh ./hello
>
>Charm++: scheduler running in netpoll mode.
>Running Hello on 4 processors for 8 elements
>myClient. a=123(0x77a95c), b[0]=456(0x77a960), b[1]=789.
>myClient. a=123(0x77a95c), b[0]=456(0x77a960), b[1]=789.
>myClient. a=123(0x77a95c), b[0]=456(0x77a960), b[1]=789.
>[0] Checkpoint starting in log
>Main's PUPer. a=123(0x77a95c), b[0]=456(0x77a960), b[1]=789.
>------------- Processor 1 Exiting: Called CmiAbort ------------
>Reason: Failed to create checkpoint file for group table!
>Stack Traceback:
> [0] CmiAbort+0x55 [0x4cc66b]
> [1] _ZN15CkCheckpointMgr10CheckpointEPKcR10CkCallback+0xc0 [0x4a7674]
> [2]
> _ZN23CkIndex_CkCheckpointMgr26_call_Checkpoint_marshall2EPvP15CkCheckpointMgr+0x9b
> [0x4a7a49]
> [3] CkDeliverMessageFree+0x30 [0x46e818]
> [4]
> /home/lmartinez/namd-compile/NAMD_2.6_Source/charm/net-linux-amd64/tests/charm++/chkpt/./hello
> [0x46e877]
> [5]
> /home/lmartinez/namd-compile/NAMD_2.6_Source/charm/net-linux-amd64/tests/charm++/chkpt/./hello
> [0x46fb59]
> [6]
> /home/lmartinez/namd-compile/NAMD_2.6_Source/charm/net-linux-amd64/tests/charm++/chkpt/./hello
> [0x4720f2]
> [7]
> /home/lmartinez/namd-compile/NAMD_2.6_Source/charm/net-linux-amd64/tests/charm++/chkpt/./hello
> [0x4723a0]
> [8] _Z15_processHandlerPvP11CkCoreState+0x130 [0x4733c0]
> [9] CmiHandleMessage+0xa5 [0x4d36c1]
> [10] CsdScheduleForever+0x75 [0x4d3a82]
> [11] CsdScheduler+0x16 [0x4d39e5]
> [12]
> /home/lmartinez/namd-compile/NAMD_2.6_Source/charm/net-linux-amd64/tests/charm++/chkpt/./hello
> [0x4d1826]
> [13] ConverseInit+0x2f6 [0x4d1cba]
> [14] main+0x2d [0x476b61]
> [15] __libc_start_main+0xf4 [0x2b94b4c03134]
> [16] __gxx_personality_v0+0x91 [0x45d5a9]
>Fatal error on PE 1> Failed to create checkpoint file for group table!
>
>Using mpirun: ( mpirun n0-1 -np 4 ./hello ) from the corresponding
>mpi compiled charm++ test directory.
>
>Running Hello on 4 processors for 8 elements
>myClient. a=123(0x70cb7c), b[0]=456(0x70cb80), b[1]=789.
>myClient. a=123(0x70cb7c), b[0]=456(0x70cb80), b[1]=789.
>myClient. a=123(0x70cb7c), b[0]=456(0x70cb80), b[1]=789.
>[0] Checkpoint starting in log
>------------- Processor 3 Exiting: Called CmiAbort ------------
>Reason: Failed to create checkpoint file for group table!
>------------- Processor 1 Exiting: Called CmiAbort ------------
>Reason: Failed to create checkpoint file for group table!
>Stack Traceback:
> [0] CmiAbort+0x2f [0x4c1710]
> [1] _ZN15CkCheckpointMgr10CheckpointEPKcR10CkCallback+0xba [0x49d8f2]
> [2]
> _ZN23CkIndex_CkCheckpointMgr26_call_Checkpoint_marshall2EPvP15CkCheckpointMgr+0x9b
> [0x49dca3]
> [3] CkDeliverMessageFree+0x2e [0x467a3c]
> [4] ./hello [0x467a99]
> [5] ./hello [0x467b04]
> [6] ./hello [0x46aa87]
> [7] ./hello [0x46adee]
> [8] _Z15_processHandlerPvP11CkCoreState+0x118 [0x46bc18]
> [9] CmiHandleMessage+0x7a [0x4c294b]
> [10] CsdScheduleForever+0x5f [0x4c2ba6]
> [11] CsdScheduler+0x16 [0x4c2b1f]
> [12] ./hello [0x4c13f6]
> [13] ConverseInit+0x2dd [0x4c16df]
> [14] main+0x2b [0x46f1d3]
> [15] __libc_start_main+0xf4 [0x2ba4999ac134]
> [16] __gxx_personality_v0+0x79 [0x457c99]
>Stack Traceback:
> [0] CmiAbort+0x2f [0x4c1710]
> [1] _ZN15CkCheckpointMgr10CheckpointEPKcR10CkCallback+0xba [0x49d8f2]
> [2]
> _ZN23CkIndex_CkCheckpointMgr26_call_Checkpoint_marshall2EPvP15CkCheckpointMgr+0x9b
> [0x49dca3]
> [3] CkDeliverMessageFree+0x2e [0x467a3c]
> [4] ./hello [0x467a99]
> [5] ./hello [0x467b04]
> [6] ./hello [0x46aa87]
> [7] ./hello [0x46adee]
> [8] _Z15_processHandlerPvP11CkCoreState+0x118 [0x46bc18]
> [9] CmiHandleMessage+0x7a [0x4c294b]
> [10] CsdScheduleForever+0x5f [0x4c2ba6]
> [11] CsdScheduler+0x16 [0x4c2b1f]
> [12] ./hello [0x4c13f6]
> [13] ConverseInit+0x2dd [0x4c16df]
> [14] main+0x2b [0x46f1d3]
> [15] __libc_start_main+0xf4 [0x2b147a85d134]
> [16] __gxx_personality_v0+0x79 [0x457c99]
>-----------------------------------------------------------------------------
>One of the processes started by mpirun has exited with a nonzero exit
>code. This typically indicates that the process finished in error.
>If your process did not finish in error, be sure to include a "return
>0" or "exit(0)" in your C code before exiting the application.
>
>PID 30636 failed on node n0
>(<http://10.0.0.100>10.0.0.100) with exit status 1.
>-----------------------------------------------------------------------------
>
>
>Another information:
>
>Just to illustrate other kind of problem we have observed, which has occurred
>now for a mpi run of the apoa benchmark. The simulation stops because
>some atom is moving too fast. However, check the velocities:
>
>TIMING: 15840 CPU: 9285.78,
>0.509619/step Wall: 9285.78, 0.509619/step,
>705.562 hours remaining, 126605 kB of memory in use.
>ERROR: Atom 48403 velocity is -11.8617 -2.45979e+87 8.48634 (limit is 10000)
>ERROR: Atoms moving too fast; simulation has become unstable.
>ERROR: Exiting prematurely.
>==========================================
>WallClock: 9291.767578 CPUTime: 9291.767578 Memory: 126585 kB
>End of program
>
>Clearly this is data corruption, not a "physical" problem.
>
>Leandro.
>
>
>
>
>
>
>
>
>
>
>
>
>On 5/22/07, Brian Bennion
><<mailto:bennion1_at_llnl.gov> bennion1_at_llnl.gov> wrote:
> >
> > Hello Leandro,
> >
> > I sent several messages last week asking
> about the charm compilation. Did
> > you get the charm++ test to work?
> > The fact that memory paranoid caught a bad memory access leads me to
> > believe the charm++ underlayer is not compiled correctly
> >
> > Brian
> >
> >
> >
> > At 06:17 AM 5/22/2007, Leandro Martínez wrote:
> >
> >
> > Hi Gengbin, Brian and others,
> > I have compiled namd for mpi, and the simulation also crashed, with
> > the message given at the end of the email. The same simulation is running
> > in our opteron cluster for more than four days now (more than two million
> > steps).
> >
> > The apoa benchmark also crashed using mpi, this message
> > was observed after step 41460 and was only
> > (command line: mpirun n0-1 -np 4 ./namd2 apoa1.namd):
> >
> >
> -----------------------------------------------------------------------------
> > One of the processes started by mpirun has exited with a nonzero exit
> > code. This typically indicates that the process finished in error.
> > If your process did not finish in error, be sure to include a "return
> > 0" or "exit(0)" in your C code before exiting the application.
> >
> > PID 8711 failed on node n0
> (<http://10.0.0.100> 10.0.0.100 ) due to signal 11.
> >
> -----------------------------------------------------------------------------
> >
> > Using a binary compiled with "-memory os" and "-thread context"
> > (running with +netpoll) the simulation (the apoa benchmark) crashes before
> > the first timestep, with (same thing with our simulation):
> >
> > Info: Finished startup with 22184 kB of memory in use.
> > ------------- Processor 0 Exiting: Caught Signal ------------
> > Signal: segmentation violation
> > Suggestion: Try running with '++debug', or
> linking with '-memory paranoid'.
> > Stack Traceback:
> > [0] /lib/libc.so.6 [0x2b24659505c0]
> > [1] _ZN10Controller9threadRunEPS_+0 [0x5d4520]
> > Fatal error on PE 0> segmentation violation
> >
> > The best insight we had I think is the fact that the "memory
> > paranoid" executable running our simulation
> > does not crash in dual processor opteron
> > machines, but crashes in dual-core machines when
> > running with more than one process per node, before
> > the first time step of the simulation. The apoa simulation
> > does not crash before the first time step, but we haven't
> > run it for long.
> > I feel that there
> > is some problem with memory sharing in dual core machines,
> > I guess. Does anybody more has clusters running with this
> > kind of architecture? If somebody does, which is the ammount
> > of memory per node?
> >
> > Clearly we cannot rule out some low-level communication problem.
> > However, as I said before, we have already changed every
> > piece of hardware and software (not the power supplies of the
> > cpus, I think, could be that for any odd reason?).
> >
> > Any clue?
> > Leandro.
> >
> > ------------------
> > Crash of our benchmark using the mpi compiled namd with:
> > mpirun n0-1 -np 4 ./namd2 test.namd
> >
> > ENERGY: 6100 20313.1442
> 13204.7818 1344.4563 138.9099
> > -253294.1741 24260.9686 0.0000 0.0000
> >
> 53056.0154 -140975.8980 298.9425 -140488.7395 -140496.4680
> > 299.1465 -1706.3395 -1586.1351 636056.0000
> > -1691.7028 -1691.3721
> >
> > FATAL ERROR: pairlist i_upper mismatch!
> > FATAL ERROR: See
> >
> <http://www.ks.uiuc.edu/Research/namd/bugreport.html>http://www.ks.uiuc.edu/Research/namd/bugreport.html
> > ------------- Processor 0 Exiting: Called CmiAbort ------------
> > Reason: FATAL ERROR: pairlist i_upper mismatch!
> > FATAL ERROR: See
> >
> <http://www.ks.uiuc.edu/Research/namd/bugreport.html>http://www.ks.uiuc.edu/Research/namd/bugreport.html
> >
> > Stack Traceback:
> > [0] CmiAbort+0x2f [0x734220]
> > [1] _Z8NAMD_bugPKc+0x4f [0x4b7aaf]
> > [2]
> > _ZN20ComputeNonbondedUtil9calc_pairEP9nonbonded+0x52c
> > [0x54755c]
> > [3]
> > _ZN20ComputeNonbondedPair7doForceEPP8CompAtomPP7Results+0x580
> > [0x50a8f0]
> > [4] _ZN16ComputePatchPair6doWorkEv+0xca [0x5bc5da]
> > [5]
> >
> _ZN19CkIndex_WorkDistrib31_call_enqueueWorkB_LocalWorkMsgEPvP11WorkDistrib+0xd
> > [0x683a5d]
> > [6] CkDeliverMessageFree+0x2e [0x6d2aa8]
> > [7] ./namd2 [0x6d2b05]
> > [8] ./namd2 [0x6d2b70]
> > [9] ./namd2 [0x6d5af3]
> > [10] ./namd2 [0x6d5e5a]
> > [11] _Z15_processHandlerPvP11CkCoreState+0x118
> > [0x6d6c84]
> > [12] CmiHandleMessage+0x7a [0x73545b]
> > [13] CsdScheduleForever+0x5f [0x7356b6]
> > [14] CsdScheduler+0x16 [0x73562f]
> > [15] _ZN9ScriptTcl3runEPc+0x11a [0x664c4a]
> > [16] main+0x125 [0x4b9ef5]
> > [17] __libc_start_main+0xf4 [0x2ab5f4b1f134]
> > [18] __gxx_personality_v0+0xf1 [0x4b72e9]
> >

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:46:20 CST