From: Leandro Martínez (leandromartinez98_at_gmail.com)
Date: Tue May 22 2007 - 08:17:33 CDT
Hi Gengbin, Brian and others,
I have compiled namd for mpi, and the simulation also crashed, with
the message given at the end of the email. The same simulation is running in
our opteron cluster for more than four days now (more than two million
steps).
The apoa benchmark also crashed using mpi, this message
was observed after step 41460 and was only
(command line: mpirun n0-1 -np 4 ./namd2 apoa1.namd):
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.
PID 8711 failed on node n0 (10.0.0.100) due to signal 11.
-----------------------------------------------------------------------------
Using a binary compiled with "-memory os" and "-thread context"
(running with +netpoll) the simulation (the apoa benchmark) crashes before
the first timestep, with (same thing with our simulation):
Info: Finished startup with 22184 kB of memory in use.
------------- Processor 0 Exiting: Caught Signal ------------
Signal: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid'.
Stack Traceback:
[0] /lib/libc.so.6 [0x2b24659505c0]
[1] _ZN10Controller9threadRunEPS_+0 [0x5d4520]
Fatal error on PE 0> segmentation violation
The best insight we had I think is the fact that the "memory
paranoid" executable running our simulation
does not crash in dual processor opteron
machines, but crashes in dual-core machines when
running with more than one process per node, before
the first time step of the simulation. The apoa simulation
does not crash before the first time step, but we haven't
run it for long.
I feel that there
is some problem with memory sharing in dual core machines,
I guess. Does anybody more has clusters running with this
kind of architecture? If somebody does, which is the ammount
of memory per node?
Clearly we cannot rule out some low-level communication problem.
However, as I said before, we have already changed every
piece of hardware and software (not the power supplies of the
cpus, I think, could be that for any odd reason?).
Any clue?
Leandro.
------------------
Crash of our benchmark using the mpi compiled namd with:
mpirun n0-1 -np 4 ./namd2 test.namd
ENERGY: 6100 20313.1442 13204.7818 1344.4563
138.9099
-253294.1741 24260.9686 0.0000 0.0000 53056.0154
-140975.8980 298.9425 -140488.7395 -140496.4680
299.1465
-1706.3395 -1586.1351 636056.0000 -1691.7028 -1691.3721
FATAL ERROR: pairlist i_upper mismatch!
FATAL ERROR: See http://www.ks.uiuc.edu/Research/namd/bugreport.html
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: FATAL ERROR: pairlist i_upper mismatch!
FATAL ERROR: See http://www.ks.uiuc.edu/Research/namd/bugreport.html
Stack Traceback:
[0] CmiAbort+0x2f [0x734220]
[1] _Z8NAMD_bugPKc+0x4f [0x4b7aaf]
[2] _ZN20ComputeNonbondedUtil9calc_pairEP9nonbonded+0x52c [0x54755c]
[3] _ZN20ComputeNonbondedPair7doForceEPP8CompAtomPP7Results+0x580
[0x50a8f0]
[4] _ZN16ComputePatchPair6doWorkEv+0xca [0x5bc5da]
[5]
_ZN19CkIndex_WorkDistrib31_call_enqueueWorkB_LocalWorkMsgEPvP11WorkDistrib+0xd
[0x683a5d]
[6] CkDeliverMessageFree+0x2e [0x6d2aa8]
[7] ./namd2 [0x6d2b05]
[8] ./namd2 [0x6d2b70]
[9] ./namd2 [0x6d5af3]
[10] ./namd2 [0x6d5e5a]
[11] _Z15_processHandlerPvP11CkCoreState+0x118 [0x6d6c84]
[12] CmiHandleMessage+0x7a [0x73545b]
[13] CsdScheduleForever+0x5f [0x7356b6]
[14] CsdScheduler+0x16 [0x73562f]
[15] _ZN9ScriptTcl3runEPc+0x11a [0x664c4a]
[16] main+0x125 [0x4b9ef5]
[17] __libc_start_main+0xf4 [0x2ab5f4b1f134]
[18] __gxx_personality_v0+0xf1 [0x4b72e9]
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:44:42 CST