Simulation stalled at startup

From: Sean Law (magicmen_at_hotmail.com)
Date: Mon Jun 16 2008 - 11:01:06 CDT

Hi NAMD List,

Over the past two months I've been experiencing jobs that would stall at startup. I would call NAMD using:

charmrun ++nodelist nodelist +p14 namd2 +netpoll +giga

I've left out the entire path for convenience. In case you were wondering, I'm using "+netpoll" because I had an earlier issue with the "Stray PME Grid" errors and using this option has remedied the error though it is still unclear why this is (and it causes some slow down in performance).

In return, the output/log file is as follows:

Charm++: scheduler running in netpoll mode.
Info: NAMD 2.6 for Linux-amd64
Info:
Info: Please visit http://www.ks.uiuc.edu/Research/namd/
Info: and send feedback or bug reports to namd_at_ks.uiuc.edu
Info:
Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
Info: in all publications reporting results obtained with NAMD.
Info:
Info: Based on Charm++/Converse 50900 for net-linux-amd64-iccstatic
Info: Built Wed Aug 30 12:54:51 CDT 2006 by jim on belfast.ks.uiuc.edu
Info: 1 NAMD 2.6 Linux-amd64 14 scw-014 slaw
Info: Running on 14 processors.
Info: 7608 kB of memory in use.
Info: Memory usage based on mallinfo
Info: Configuration file is cmdfile
TCL: Suspending until startup complete.
Info: EXTENDED SYSTEM FILE atpar.45.restart.xsc
Info: SIMULATION PARAMETERS:
Info: TIMESTEP 2
Info: NUMBER OF STEPS 350000
Info: STEPS PER CYCLE 10
Info: PERIODIC CELL BASIS 1 152.105 0 0
Info: PERIODIC CELL BASIS 2 0 114.847 0
Info: PERIODIC CELL BASIS 3 0 0 92.0551
Info: PERIODIC CELL CENTER 0 0 0
Info: WRAPPING WATERS AROUND PERIODIC BOUNDARIES ON OUTPUT.
Info: WRAPPING ALL CLUSTERS AROUND PERIODIC BOUNDARIES ON OUTPUT.
Info: LOAD BALANCE STRATEGY Other
Info: LDB PERIOD 2000 steps
Info: FIRST LDB TIMESTEP 50
Info: LDB BACKGROUND SCALING 1
Info: HOM BACKGROUND SCALING 1
Info: PME BACKGROUND SCALING 1
Info: MAX SELF PARTITIONS 50
Info: MAX PAIR PARTITIONS 20
Info: SELF PARTITION ATOMS 125
Info: PAIR PARTITION ATOMS 200
Info: PAIR2 PARTITION ATOMS 400
Info: MIN ATOMS PER PATCH 100
Info: VELOCITY FILE atpar.45.restart.vel
Info: CENTER OF MASS MOVING INITIALLY? NO
Info: DIELECTRIC 1
Info: EXCLUDE SCALED ONE-FOUR
Info: 1-4 SCALE FACTOR 1
Info: DCD FILENAME atpar.46.dcd
Info: DCD FREQUENCY 500
Info: DCD FIRST STEP 500
Info: DCD FILE WILL CONTAIN UNIT CELL DATA
Info: NO EXTENDED SYSTEM TRAJECTORY OUTPUT
Info: NO VELOCITY DCD OUTPUT
Info: OUTPUT FILENAME atpar.46.restart
Info: BINARY OUTPUT FILES WILL BE USED
Info: NO RESTART FILE
Info: SWITCHING ACTIVE
Info: SWITCHING ON 8.5
Info: SWITCHING OFF 10
Info: PAIRLIST DISTANCE 12.5
Info: PAIRLIST SHRINK RATE 0.01
Info: PAIRLIST GROW RATE 0.01
Info: PAIRLIST TRIGGER 0.3
Info: PAIRLISTS PER CYCLE 2
Info: PAIRLISTS ENABLED
Info: MARGIN 0.45
Info: HYDROGEN GROUP CUTOFF 2.5
Info: PATCH DIMENSION 15.45
Info: ENERGY OUTPUT STEPS 500
Info: CROSSTERM ENERGY INCLUDED IN DIHEDRAL
Info: TIMING OUTPUT STEPS 1000
Info: LANGEVIN DYNAMICS ACTIVE
Info: LANGEVIN TEMPERATURE 298
Info: LANGEVIN DAMPING COEFFICIENT IS 5 INVERSE PS
Info: LANGEVIN DYNAMICS APPLIED TO HYDROGENS
Info: LANGEVIN PISTON PRESSURE CONTROL ACTIVE
Info: TARGET PRESSURE IS 1 BAR
Info: OSCILLATION PERIOD IS 200 FS
Info: DECAY TIME IS 100 FS
Info: PISTON TEMPERATURE IS 298 K
Info: PRESSURE CONTROL IS GROUP-BASED
Info: INITIAL STRAIN RATE IS 8.14713e-07 8.14713e-07 8.14713e-07
Info: CELL FLUCTUATION IS ISOTROPIC
Info: PARTICLE MESH EWALD (PME) ACTIVE
Info: PME TOLERANCE 1e-06
Info: PME EWALD COEFFICIENT 0.312341
Info: PME INTERPOLATION ORDER 4
Info: PME GRID DIMENSIONS 160 120 96
Info: PME MAXIMUM GRID SPACING 1
Info: Attempting to read FFTW data from FFTW_NAMD_2.6_Linux-amd64.txt
Info: Optimizing 6 FFT steps. 1... 2... 3... 4... 5... 6... Done.
Info: Writing FFTW data to FFTW_NAMD_2.6_Linux-amd64.txt
Info: FULL ELECTROSTATIC EVALUATION FREQUENCY 1
Info: USING VERLET I (r-RESPA) MTS SCHEME.
Info: C1 SPLITTING OF LONG RANGE ELECTROSTATICS
Info: PLACING ATOMS IN PATCHES BY HYDROGEN GROUPS
Info: RIGID BONDS TO HYDROGEN : ALL
Info: ERROR TOLERANCE : 1e-08
Info: MAX ITERATIONS : 100
Info: RIGID WATER USING SETTLE ALGORITHM
Info: RANDOM NUMBER SEED 2187543
Info: USE HYDROGEN BONDS? NO
Info: COORDINATE PDB atpar.45.pdb
Info: STRUCTURE FILE ATPAR.NAMD.psf
Info: PARAMETER file: CHARMM format!
Info: PARAMETERS /mnt/home/slaw/mmtsb2/data/charmm/par_all27_prot_na_cmap.prm
Info: USING ARITHMETIC MEAN TO COMBINE L-J SIGMA PARAMETERS
Info: BINARY COORDINATES atpar.45.restart.coor
Info: SUMMARY OF PARAMETERS:
Info: 257 BONDS
Info: 656 ANGLES
Info: 1127 DIHEDRAL
Info: 70 IMPROPER
Info: 6 CROSSTERM
Info: 164 VDW
Info: 0 VDW_PAIRS
Warning: Ignored 46371 bonds with zero force constants.
Warning: Will get H-H distance in rigid H2O from H-O-H angle.
Info: ****************************
Info: STRUCTURE SUMMARY:
Info: 165540 ATOMS
Info: 119404 BONDS
Info: 94616 ANGLES
Info: 70436 DIHEDRALS
Info: 4327 IMPROPERS
Info: 1594 CROSSTERMS
Info: 0 EXCLUSIONS
Info: 152168 RIGID BONDS
Info: 344452 DEGREES OF FREEDOM
Info: 59743 HYDROGEN GROUPS
Info: TOTAL MASS = 1.02683e+06 amu
Info: TOTAL CHARGE = 2.17427e-05 e
Info: *****************************
Info: Entering startup phase 0 with 70768 kB of memory in use.
Info: Entering startup phase 1 with 70768 kB of memory in use.

I've have tried searching through the e-mail list and communicated with the team that manages our cluster and we are unable to identify the issue. However, I have noticed that simply resubmitting the job (and getting different nodes) solves the problem. Thus, I'm not sure if this is a problem form NAMD or if it's a case of bad communication between the nodes. Having said that, NAMD doesn't end up exiting at this stage of processing but appears to be idle until the wall-time has been exceeded the alloted queuing system.

Any guidance would be greatly appreciated!

Sean Law
Michigan State University
Ph.D. Candidtate

_________________________________________________________________

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:49:35 CST