FEP OpenMPI Error

From: Andrew Miglino (miglino_at_udel.edu)
Date: Fri Mar 22 2013 - 10:43:43 CDT

Hi everyone,

I'm running the FEP calculation for methane hydration from the ABF tutorial
as an attempt to determine if our school's cluster is ready for some more
intensive computations. We've run the simulation using host computers
(single core) without problem. Once we bring the simulation to the cluster,
however, we get a failure at the tcl script At the change of lambda most
system parameters go to nan and a low global exclusion count error is
encountered. I've pasted the output below. We're running the tutorial files
found in the example-output folder without change. Is this a problem with
the way we're using namd? or is it a cluster problem?

WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 52400
WRITING COORDINATES TO RESTART FILE AT STEP 52500
FINISHED WRITING RESTART COORDINATES
WRITING VELOCITIES TO RESTART FILE AT STEP 52500
FINISHED WRITING RESTART VELOCITIES
PRESSURE: 52500 191.239 -34.4462 -25.4595 -34.4462 -292.728 -42.4961
-25.4595 -42.4961 -1189.32
GPRESSURE: 52500 375.037 -83.8694 -124.928 285.339 -404.07 -53.1919 85.467
-531.245 -1146.56
PRESSAVG: 52500 111.711 37.6996 123.276 37.6996 289.827 147.165 123.276
147.165 -457.775
GPRESSAVG: 52500 98.4584 55.6138 98.2165 52.925 291.222 157.961 122.309
128.576 -464.359
TIMING: 52500 CPU: 212.939, 0.00408394/step Wall: 212.939,
0.00408393/step, 0 hours remaining, 374.691406 MB of memory in use.
ENERGY: 52500 0.0000 3.0802 0.0000 0.0000
         -8845.4311 855.7251 0.0000 0.0000
 1469.4566 -6517.1692 299.9246 -7986.6258 -6516.6278
      295.9736 -430.2703 -391.8641 24514.0934
-18.7455 -24.8930

WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 52500
TCL: Running FEP window 2: Lambda1 0.1 Lambda2 0.2 [dLambda 0.1 ]
TCL: Setting parameter firsttimestep to 0
TCL: Setting parameter alchLambda to 0.1
Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625
Info: NONBONDED TABLE SIZE: 705 POINTS
Info: ABSOLUTE IMPRECISION IN VDWB TABLE ENERGY: 3.30872e-24 AT 10.1458
Info: RELATIVE IMPRECISION IN VDWB TABLE ENERGY: 9.64837e-17 AT 10.1458
TCL: Setting parameter alchLambda2 to 0.2
Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625
Info: NONBONDED TABLE SIZE: 705 POINTS
Info: ABSOLUTE IMPRECISION IN VDWB TABLE ENERGY: 3.30872e-24 AT 10.1458
Info: RELATIVE IMPRECISION IN VDWB TABLE ENERGY: 9.64837e-17 AT 10.1458
TCL: Original numsteps 52500 will be ignored.
TCL: Running for 52500 steps
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: FATAL ERROR: Low global exclusion count! (2230 vs 2470) System
unstable or pairlistdist or cutoff too small.

FATAL ERROR: See http://www.ks.uiuc.edu/Research/namd/bugreport.html

FEP: RESETTING FOR NEW FEP WINDOW LAMBDA SET TO 0.1 LAMBDA2 0.2
FEP: WINDOW TO HAVE 2500 STEPS OF EQUILIBRATION PRIOR TO FEP DATA
COLLECTION.
FEP: USING CONSTANT TEMPERATURE OF 300 K FOR FEP CALCULATION
PRESSURE: 0 -nan -nan -nan -nan -nan -nan -nan -nan -nan
GPRESSURE: 0 -nan -nan -nan -nan -nan -nan -nan -nan -nan
ETITLE: TS BOND ANGLE DIHED IMPRP
              ELECT VDW BOUNDARY MISC
 KINETIC TOTAL TEMP POTENTIAL TOTAL3
     TEMPAVG PRESSURE GPRESSURE VOLUME
PRESSAVG GPRESSAVG

ENERGY: 0 0.0000 3.0802 0.0000 0.0000
         91756.1386 856.0801 0.0000 0.0000
-nan -nan -nan 92615.2988 -nan
    -nan -nan -nan 24514.0934 -nan
        -nan

FATAL ERROR: Low global exclusion count! (2230 vs 2470) System unstable
or pairlistdist or cutoff too small.

FATAL ERROR: See http://www.ks.uiuc.edu/Research/namd/bugreport.html
[0] Stack Traceback:
  [0:0] LrtsAbort+0x50 [0xc029d0]
  [0:1] CmiAbort+0x6 [0xc02a86]
  [0:2] _Z8NAMD_bugPKc+0x77 [0x569667]
  [0:3] _ZN10Controller16compareChecksumsEii+0x77f [0x88ee1f]
  [0:4] _ZN10Controller9integrateEv+0x1220 [0x88ae40]
  [0:5] _ZN10Controller9algorithmEv+0x6b4 [0x886164]
  [0:6] _ZN10Controller9threadRunEPS_+0x7 [0x899617]
  [0:7] CthStartThread+0x1c [0xb1acec]
  [0:8] +0x43b80 [0x2b1653383b80]
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 13 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 9 with PID 10309 on
node n137 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[n137:10299] 23 more processes have sent help message help-mpi-api.txt /
mpi-abort
[n137:10299] Set MCA parameter "orte_base_help_aggregate" to 0 to see all
help / error messages
-- end OPENMPI run --

This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:21:03 CST