From: Leandro Martínez (leandromartinez98_at_gmail.com)
Date: Thu May 17 2007 - 13:34:43 CDT
Hi all,
We are having systematic problems trying to run NAMD on our linux
cluster. We are experienced users, we have already configured, used
and run simulations with NAMD on a dual Opteron cluster running linux
(not dual-core) for some years.
Currently we are working with a test-cluster composed
by two dual core AthlonX2 processors. We are trying to run a
test simulation we already have run with no problems in the older
cluster, but now the simulation always crashes with a message like:
------------- Processor 0 Exiting: Caught Signal ------------
Signal: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid'.
Stack Traceback:
[0] /lib/libc.so.6 [0x2b770d18a5c0]
... (see all bellow)
The crashes are aleatory, in the sense that they occur after one,
two, or twelve ours of simulation, without any correlation with
simulation time or timestep.
We are having this problem despite trying changing almost everything
on our cluster achitecture and configuraton. We have tryied, for
example:
1. Running the simulation on the old Opteron 242 machines, which
worked a few month ago when it ran Fedora 3.0, but when we upgraded to
Fedora 6.0 we started having the same error. This cluster running
with Etherboot. The other with PXE.
2. We have tried two different cpu architectures for the Athlon X2 processors.
3. We have tried Fedora 6.0, Ubuntu 7.04 and Gentoo.
4. We have tried different executables of NAMD, meaning amd64-TCP,
amd64 (no TCP), and we have compiled it in our cluster with gentoo.
5. We have run memtest86, and found no problems.
6. We even replaced switches.
Basically we have no idea what to test anymore. We have seen that
Cesar Avila had a similar problem once, but aparently he found that it
was a memory problem which was detected with memtest.
Any information regarding the clusters everyone is using here
would be very useful, for example:
1. Who is using 64 bit cluster machines? Particularly Amd64 machines.
2. Which operating systems are being used? (Linux distributions
and versions, for example).
3. Which version of NAMD, kernels, compilers....
4. Wether TCP is being used, the amount of memory of the master and nodes.
5. Where the executables of charmm and namd compiled for the cluster
or the binaries distributed were used?
Examples of the log file of crashed simulations are below.
We will be very grateful for any information regarding cluster
experiences with namd, even if your cluster runs with no problems, for
us to compare architectures.
Thank you very much,
Leandro.
Examples of errors obtained:
ENERGY: 339900 20337.5812 13287.7936 1320.7207
123.9507 -257827.1413 25328.9455 0.0000
0.0000 52952.3084 -144475.8412 298.3581
-143994.1155 -144005.1537 298.0670 -686.5786
-510.5889 636056.0000 -473.2155 -474.2152
ERROR: Margin is too small for 1 atoms during timestep 339988.
ERROR: Incorrect nonbonded forces and energies may be calculated!
RESCALING VELOCITIES AT STEP 340000 FROM AVERAGE TEMPERATURE OF nan TO
298.15 KELVIN.
------------- Processor 0 Exiting: Caught Signal ------------
Signal: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid'.
Stack Traceback:
[0] /lib64/libc.so.6 [0x3a4da30210]
[1] _ZN20ComputeNonbondedUtil16calc_self_energyEP9nonbonded+0x2132 [0x554568]
[2] _ZN20ComputeNonbondedSelf7doForceEP8CompAtomP7Results+0x434 [0x52e82e]
[3] _ZN12ComputePatch6doWorkEv+0x77 [0x615481]
[4] _ZN11WorkDistrib12enqueueSelfAEP12LocalWorkMsg+0x16 [0x728c56]
[5] _ZN19CkIndex_WorkDistrib31_call_enqueueSelfA_LocalWorkMsgEPvP11WorkDistrib+0xf
[0x728c3d]
[6] CkDeliverMessageFree+0x21 [0x786a6b]
[7] _Z15_processHandlerPvP11CkCoreState+0x455 [0x786075]
[8] CsdScheduleForever+0xa2 [0x7f18a2]
[9] CsdScheduler+0x1c [0x7f14a0]
[10] _ZN7BackEnd7suspendEv+0xb [0x4bab01]
[11] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x122 [0x6fc260]
[12] TclInvokeStringCommand+0x91 [0x80cc78]
[13] /home/lmartinez/teste-novaswitch/./namd2 [0x842ac8]
[14] Tcl_EvalEx+0x176 [0x84310b]
[15] Tcl_EvalFile+0x134 [0x83ab14]
[16] _ZN9ScriptTcl3runEPc+0x14 [0x6fb99e]
[17] main+0x21b [0x4b69c3]
[18] __libc_start_main+0xf4 [0x3a4da1da44]
[19] _ZNSt8ios_base4InitD1Ev+0x3a [0x4b2b5a]
Fatal error on PE 0> segmentation violation
Other example:
TIMING: 9800 CPU: 4002.45, 0.407105/step Wall: 4413.52,
0.450273/step, 12506.4 hours remaining, 110348 kB of memory in use.
ENERGY: 9800 20072.8949 13201.1162 1343.0570
131.8154 -253024.4316 24209.4947 0.0000
0.0000 53016.0637 -141049.9898 298.7174
-140572.1009 -140570.0247 299.2188 -1871.3003
-1682.6774 636056.0000 -1712.7694 -1711.5329
------------- Processor 0 Exiting: Caught Signal ------------
Signal: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid'.
Stack Traceback:
[0] /lib/libc.so.6 [0x2ae403d345c0]
[1] _int_malloc+0xb6 [0x778758]
[2] mm_malloc+0x53 [0x7785f7]
[3] malloc+0x16 [0x77c5fc]
[4] _Znwm+0x1d [0x2ae403bc0b4d]
[5] _ZN11ResizeArrayI6VectorEC1Ev+0x28 [0x6767e0]
[6] __cxa_vec_ctor+0x46 [0x2ae403bc22a6]
[7] _ZN14ProxyResultMsg6unpackEPv+0x62 [0x6e83ca]
[8] _Z15CkUnpackMessagePP8envelope+0x28 [0x787036]
[9] _Z15_processHandlerPvP11CkCoreState+0x412 [0x785db2]
[10] CsdScheduleForever+0xa2 [0x7f2492]
[11] CsdScheduler+0x1c [0x7f2090]
[12] _ZN7BackEnd7suspendEv+0xb [0x4ba881]
[13] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x122 [0x6fbfe0]
[14] TclInvokeStringCommand+0x91 [0x80d9b8]
[15] /home/lmartinez/teste-namd/./namd2 [0x843808]
[16] Tcl_EvalEx+0x176 [0x843e4b]
[17] Tcl_EvalFile+0x134 [0x83b854]
[18] _ZN9ScriptTcl3runEPc+0x14 [0x6fb71e]
[19] main+0x21b [0x4b6743]
[20] __libc_start_main+0xf4 [0x2ae403d22134]
[21] _ZNSt8ios_base4InitD1Ev+0x3a [0x4b28da]
Fatal error on PE 0> segmentation violation
LDB: LOAD: AVG 16.0673 MAX 19.091 MSGS: TOTAL 228 MAXC 14 MAXP 6 None
LDB: LOAD: AVG 16.0673 MAX 16.3845 MSGS: TOTAL 228 MAXC 14 MAXP 6 Refine
Info: Benchmark time: 18 CPUs 0.271986 s/step 3.14799 days/ns 45676 kB memory
Info: Benchmark time: 18 CPUs 0.227697 s/step 2.63539 days/ns 45676 kB memory
Info: Benchmark time: 18 CPUs 0.251674 s/step 2.91289 days/ns 45676 kB memory
------------- Processor 0 Exiting: Caught Signal ------------
Signal: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid'.
Stack Traceback:
[0] /lib64/libc.so.6 [0x3b1e230210]
[1] _ZN20ComputeNonbondedUtil9calc_pairEP9nonbonded+0x219a [0x536f5a]
[2] _ZN20ComputeNonbondedPair7doForceEPP8CompAtomPP7Results+0x5da [0x52fc9e]
[3] _ZN16ComputePatchPair6doWorkEv+0x85 [0x615dd1]
[4] _ZN11WorkDistrib12enqueueWorkAEP12LocalWorkMsg+0x16 [0x728cd6]
[5] _ZN19CkIndex_WorkDistrib31_call_enqueueWorkA_LocalWorkMsgEPvP11WorkDistrib+0xf
[0x728cbd]
[6] CkDeliverMessageFree+0x21 [0x786a6b]
[7] _Z15_processHandlerPvP11CkCoreState+0x455 [0x786075]
[8] CsdScheduleForever+0xa2 [0x7f18a2]
[9] CsdScheduler+0x1c [0x7f14a0]
[10] _ZN7BackEnd7suspendEv+0xb [0x4bab01]
[11] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x122 [0x6fc260]
[12] TclInvokeStringCommand+0x91 [0x80cc78]
[13] /exports/home/cluster/teste-namd/./namd2 [0x842ac8]
[14] Tcl_EvalEx+0x176 [0x84310b]
[15] Tcl_EvalFile+0x134 [0x83ab14]
[16] _ZN9ScriptTcl3runEPc+0x14 [0x6fb99e]
[17] main+0x21b [0x4b69c3]
[18] __libc_start_main+0xf4 [0x3b1e21da44]
[19] _ZNSt8ios_base4InitD1Ev+0x3a [0x4b2b5a]
Fatal error on PE 0> segmentation violation
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:44:41 CST