Re: namd crash: Signal: segmentation violation

From: Leandro Martínez (leandromartinez98_at_gmail.com)
Date: Tue May 22 2007 - 12:55:22 CDT

Gengbin,
It is true that the memory paranoid binary fails to run on two processors
on the same machine, however we can run the simulations stably in a
single node. Now I think that this may be a problem with the memory
paranoid binary particularly.

Brian, sorry not telling this before. I have run all the tests of charm++
now. There is one test of charm (and mpirun)
that clearly has problems. The test is the "checkpoint" in
directory tests/charm++/chkpt

The log file of one of the compilations is at
http://limes.iqm.unicamp.br/~lmartinez/charm++_build.log
There are some errors related to fortran 90 files, but
it says that charm++ was built successfully.

The error does not occur if we run locally with two processors (charmrun
++local +p2).

The error persists if we used binaries that run fine on other machines.

We have tested this with two different charm++ compilations, one for
net-linux-amd64 and other for net-linux-amd64-smp-tcp, and we have
also tried "mpirun". The errors are the following, and could be related
to our problem:

Using net-linux-amd64 (from tests/charm++/chkpt) directory:
The command is: ./charmrun +p4 ++nodelist ./nodelist ++remote-shell ssh
./hello

Charm++: scheduler running in netpoll mode.
Running Hello on 4 processors for 8 elements
myClient. a=123(0x77a95c), b[0]=456(0x77a960), b[1]=789.
myClient. a=123(0x77a95c), b[0]=456(0x77a960), b[1]=789.
myClient. a=123(0x77a95c), b[0]=456(0x77a960), b[1]=789.
[0] Checkpoint starting in log
Main's PUPer. a=123(0x77a95c), b[0]=456(0x77a960), b[1]=789.
------------- Processor 1 Exiting: Called CmiAbort ------------
Reason: Failed to create checkpoint file for group table!
Stack Traceback:
  [0] CmiAbort+0x55 [0x4cc66b]
  [1] _ZN15CkCheckpointMgr10CheckpointEPKcR10CkCallback+0xc0 [0x4a7674]
  [2]
_ZN23CkIndex_CkCheckpointMgr26_call_Checkpoint_marshall2EPvP15CkCheckpointMgr+0x9b
 [0x4a7a49]
  [3] CkDeliverMessageFree+0x30 [0x46e818]
  [4]
/home/lmartinez/namd-compile/NAMD_2.6_Source/charm/net-linux-amd64/tests/charm++/chkpt/./hello
[0x46e877]
  [5]
/home/lmartinez/namd-compile/NAMD_2.6_Source/charm/net-linux-amd64/tests/charm++/chkpt/./hello
[0x46fb59]
  [6]
/home/lmartinez/namd-compile/NAMD_2.6_Source/charm/net-linux-amd64/tests/charm++/chkpt/./hello
[0x4720f2]
  [7]
/home/lmartinez/namd-compile/NAMD_2.6_Source/charm/net-linux-amd64/tests/charm++/chkpt/./hello
[0x4723a0]
  [8] _Z15_processHandlerPvP11CkCoreState+0x130 [0x4733c0]
  [9] CmiHandleMessage+0xa5 [0x4d36c1]
  [10] CsdScheduleForever+0x75 [0x4d3a82]
  [11] CsdScheduler+0x16 [0x4d39e5]
  [12]
/home/lmartinez/namd-compile/NAMD_2.6_Source/charm/net-linux-amd64/tests/charm++/chkpt/./hello
[0x4d1826]
  [13] ConverseInit+0x2f6 [0x4d1cba]
  [14] main+0x2d [0x476b61]
  [15] __libc_start_main+0xf4 [0x2b94b4c03134]
  [16] __gxx_personality_v0+0x91 [0x45d5a9]
Fatal error on PE 1> Failed to create checkpoint file for group table!

Using mpirun: ( mpirun n0-1 -np 4 ./hello ) from the corresponding
mpi compiled charm++ test directory.

Running Hello on 4 processors for 8 elements
myClient. a=123(0x70cb7c), b[0]=456(0x70cb80), b[1]=789.
myClient. a=123(0x70cb7c), b[0]=456(0x70cb80), b[1]=789.
myClient. a=123(0x70cb7c), b[0]=456(0x70cb80), b[1]=789.
[0] Checkpoint starting in log
------------- Processor 3 Exiting: Called CmiAbort ------------
Reason: Failed to create checkpoint file for group table!
------------- Processor 1 Exiting: Called CmiAbort ------------
Reason: Failed to create checkpoint file for group table!
Stack Traceback:
  [0] CmiAbort+0x2f [0x4c1710]
  [1] _ZN15CkCheckpointMgr10CheckpointEPKcR10CkCallback+0xba [0x49d8f2]
  [2]
_ZN23CkIndex_CkCheckpointMgr26_call_Checkpoint_marshall2EPvP15CkCheckpointMgr+0x9b
[0x49dca3]
  [3] CkDeliverMessageFree+0x2e [0x467a3c]
  [4] ./hello [0x467a99]
  [5] ./hello [0x467b04]
  [6] ./hello [0x46aa87]
  [7] ./hello [0x46adee]
  [8] _Z15_processHandlerPvP11CkCoreState+0x118 [0x46bc18]
  [9] CmiHandleMessage+0x7a [0x4c294b]
  [10] CsdScheduleForever+0x5f [0x4c2ba6]
  [11] CsdScheduler+0x16 [0x4c2b1f]
  [12] ./hello [0x4c13f6]
  [13] ConverseInit+0x2dd [0x4c16df]
  [14] main+0x2b [0x46f1d3]
  [15] __libc_start_main+0xf4 [0x2ba4999ac134]
  [16] __gxx_personality_v0+0x79 [0x457c99]
Stack Traceback:
  [0] CmiAbort+0x2f [0x4c1710]
  [1] _ZN15CkCheckpointMgr10CheckpointEPKcR10CkCallback+0xba [0x49d8f2]
  [2]
_ZN23CkIndex_CkCheckpointMgr26_call_Checkpoint_marshall2EPvP15CkCheckpointMgr+0x9b
[0x49dca3]
  [3] CkDeliverMessageFree+0x2e [0x467a3c]
  [4] ./hello [0x467a99]
  [5] ./hello [0x467b04]
  [6] ./hello [0x46aa87]
  [7] ./hello [0x46adee]
  [8] _Z15_processHandlerPvP11CkCoreState+0x118 [0x46bc18]
  [9] CmiHandleMessage+0x7a [0x4c294b]
  [10] CsdScheduleForever+0x5f [0x4c2ba6]
  [11] CsdScheduler+0x16 [0x4c2b1f]
  [12] ./hello [0x4c13f6]
  [13] ConverseInit+0x2dd [0x4c16df]
  [14] main+0x2b [0x46f1d3]
  [15] __libc_start_main+0xf4 [0x2b147a85d134]
  [16] __gxx_personality_v0+0x79 [0x457c99]
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.

PID 30636 failed on node n0 (10.0.0.100) with exit status 1.
-----------------------------------------------------------------------------

Another information:

Just to illustrate other kind of problem we have observed, which has
occurred
now for a mpi run of the apoa benchmark. The simulation stops because
some atom is moving too fast. However, check the velocities:

TIMING: 15840 CPU: 9285.78, 0.509619/step Wall: 9285.78, 0.509619/step,
705.562 hours remaining, 126605 kB of memory in use.
ERROR: Atom 48403 velocity is -11.8617 -2.45979e+87 8.48634 (limit is 10000)
ERROR: Atoms moving too fast; simulation has become unstable.
ERROR: Exiting prematurely.
==========================================
WallClock: 9291.767578 CPUTime: 9291.767578 Memory: 126585 kB
End of program

Clearly this is data corruption, not a "physical" problem.

Leandro.

On 5/22/07, Brian Bennion < bennion1_at_llnl.gov> wrote:
>
> Hello Leandro,
>
> I sent several messages last week asking about the charm compilation.
Did
> you get the charm++ test to work?
> The fact that memory paranoid caught a bad memory access leads me to
> believe the charm++ underlayer is not compiled correctly
>
> Brian
>
>
>
> At 06:17 AM 5/22/2007, Leandro Martínez wrote:
>
>
> Hi Gengbin, Brian and others,
> I have compiled namd for mpi, and the simulation also crashed, with
> the message given at the end of the email. The same simulation is running
> in our opteron cluster for more than four days now (more than two million
> steps).
>
> The apoa benchmark also crashed using mpi, this message
> was observed after step 41460 and was only
> (command line: mpirun n0-1 -np 4 ./namd2 apoa1.namd):
>
>
-----------------------------------------------------------------------------

> One of the processes started by mpirun has exited with a nonzero exit
> code. This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a "return
> 0" or "exit(0)" in your C code before exiting the application.
>
> PID 8711 failed on node n0 (10.0.0.100 ) due to signal 11.
>
-----------------------------------------------------------------------------

>
> Using a binary compiled with "-memory os" and "-thread context"
> (running with +netpoll) the simulation (the apoa benchmark) crashes
before
> the first timestep, with (same thing with our simulation):
>
> Info: Finished startup with 22184 kB of memory in use.
> ------------- Processor 0 Exiting: Caught Signal ------------
> Signal: segmentation violation
> Suggestion: Try running with '++debug', or linking with '-memory
paranoid'.
> Stack Traceback:
> [0] /lib/libc.so.6 [0x2b24659505c0]
> [1] _ZN10Controller9threadRunEPS_+0 [0x5d4520]
> Fatal error on PE 0> segmentation violation
>
> The best insight we had I think is the fact that the "memory
> paranoid" executable running our simulation
> does not crash in dual processor opteron
> machines, but crashes in dual-core machines when
> running with more than one process per node, before
> the first time step of the simulation. The apoa simulation
> does not crash before the first time step, but we haven't
> run it for long.
> I feel that there
> is some problem with memory sharing in dual core machines,
> I guess. Does anybody more has clusters running with this
> kind of architecture? If somebody does, which is the ammount
> of memory per node?
>
> Clearly we cannot rule out some low-level communication problem.
> However, as I said before, we have already changed every
> piece of hardware and software (not the power supplies of the
> cpus, I think, could be that for any odd reason?).
>
> Any clue?
> Leandro.
>
> ------------------
> Crash of our benchmark using the mpi compiled namd with:
> mpirun n0-1 -np 4 ./namd2 test.namd
>
> ENERGY: 6100 20313.1442 13204.7818 1344.4563
138.9099
> -253294.1741 24260.9686 0.0000 0.0000
> 53056.0154 -140975.8980 298.9425 -140488.7395 -
140496.4680
> 299.1465 -1706.3395 -1586.1351 636056.0000
> -1691.7028 -1691.3721
>
> FATAL ERROR: pairlist i_upper mismatch!
> FATAL ERROR: See
> http://www.ks.uiuc.edu/Research/namd/bugreport.html
> ------------- Processor 0 Exiting: Called CmiAbort ------------
> Reason: FATAL ERROR: pairlist i_upper mismatch!
> FATAL ERROR: See
> http://www.ks.uiuc.edu/Research/namd/bugreport.html
>
> Stack Traceback:
> [0] CmiAbort+0x2f [0x734220]
> [1] _Z8NAMD_bugPKc+0x4f [0x4b7aaf]
> [2]
> _ZN20ComputeNonbondedUtil9calc_pairEP9nonbonded+0x52c
> [0x54755c]
> [3]
> _ZN20ComputeNonbondedPair7doForceEPP8CompAtomPP7Results+0x580
> [0x50a8f0]
> [4] _ZN16ComputePatchPair6doWorkEv+0xca [0x5bc5da]
> [5]
>
_ZN19CkIndex_WorkDistrib31_call_enqueueWorkB_LocalWorkMsgEPvP11WorkDistrib+0xd

> [0x683a5d]
> [6] CkDeliverMessageFree+0x2e [0x6d2aa8]
> [7] ./namd2 [0x6d2b05]
> [8] ./namd2 [0x6d2b70]
> [9] ./namd2 [0x6d5af3]
> [10] ./namd2 [0x6d5e5a]
> [11] _Z15_processHandlerPvP11CkCoreState+0x118
> [0x6d6c84]
> [12] CmiHandleMessage+0x7a [0x73545b]
> [13] CsdScheduleForever+0x5f [0x7356b6]
> [14] CsdScheduler+0x16 [0x73562f]
> [15] _ZN9ScriptTcl3runEPc+0x11a [0x664c4a]
> [16] main+0x125 [0x4b9ef5]
> [17] __libc_start_main+0xf4 [0x2ab5f4b1f134]
> [18] __gxx_personality_v0+0xf1 [0x4b72e9]
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:44:42 CST