Namd hangs on Bproc cluster

From: Rene Salmon (rsalmon_at_tulane.edu)
Date: Wed Jun 08 2005 - 11:55:00 CDT

Hi List,

We are having a strange problem with NAND it seems to start running on
our cluster and then all of the sudden it stops and goes to sleep or
waits for something.

This is on an AMD64 Bproc cluster. Here is what the running/sleeping
NAMD job looks like:

0 S parag 10950 10937 0 75 0 - 1969 - 11:13 ?
00:00:00 charmrun ++skipmaster ++verbose ++debug ++n
odelist nodelist.txt ++p 8 namd2 run_0025.conf
0 S parag 10951 10950 0 75 0 - 8943 ghost_ 11:13 ?
00:00:04 [namd2]
0 S parag 10952 10950 0 75 0 - 7733 ghost_ 11:13 ?
00:00:05 [namd2]
0 S parag 10953 10950 0 75 0 - 7299 ghost_ 11:13 ?
00:00:04 [namd2]
0 S parag 10954 10950 0 75 0 - 7357 ghost_ 11:13 ?
00:00:04 [namd2]
0 S parag 10955 10950 0 75 0 - 9449 ghost_ 11:13 ?
00:00:06 [namd2]
0 S parag 10956 10950 0 75 0 - 8986 ghost_ 11:13 ?
00:00:06 [namd2]
0 S parag 10957 10950 0 75 0 - 8001 ghost_ 11:13 ?
00:00:05 [namd2]
0 S parag 10958 10950 0 75 0 - 7921 ghost_ 11:13 ?
00:00:04 [namd2]

It is just hanging here forever. Attached is the full stdout stderr log
  but here are the last few lines of this:

ENERGY: 0 497.4372 2945.9299 697.7718
26.8610
     -65990.2474 457.9247 0.0000 0.0000
8213.8267
     -53150.4962 319.2297 -53065.4816 -53065.4816
319.2297
      -4443.1718 -156.8539 365061.0449 -4443.1718 -156.8539

ENERGY: 0 497.4372 2945.9299 697.7718
26.8610
     -65990.2474 457.9247 0.0000 0.0000
8213.8267
     -53150.4962 319.2297 -53065.4816 -53065.4816
319.2297
      -4443.1718 -156.8539 365061.0449 -4443.1718 -156.8539

OPENING EXTENDED SYSTEM TRAJECTORY FILE
OPENING EXTENDED SYSTEM TRAJECTORY FILE
Info: Initial time: 8 CPUs 0.156147 s/step 0.903626 days/ns 40608 kB memory
LDB: LOAD: AVG 1.68147 MAX 2.42618 MSGS: TOTAL 152 MAXC 28 MAXP 7 None
LDB: LOAD: AVG 1.68147 MAX 2.42618 MSGS: TOTAL 152 MAXC 28 MAXP 7 None
LDB: LOAD: AVG 1.68147 MAX 1.80888 MSGS: TOTAL 152 MAXC 28 MAXP 7 Alg7
LDB: LOAD: AVG 1.68147 MAX 1.80888 MSGS: TOTAL 152 MAXC 28 MAXP 7 Alg7
LDB: LOAD: AVG 1.68147 MAX 1.71406 MSGS: TOTAL 152 MAXC 28 MAXP 7 Alg7
LDB: LOAD: AVG 1.68147 MAX 1.71406 MSGS: TOTAL 152 MAXC 28 MAXP 7 Alg7

----------------

any ideas? thank you in advance for any help on this.

Rene

Charmrun> charmrun started...
Charmrun> using nodelist.txt as nodesfile
Charmrun> node programs all started
Charmrun> node programs all connected
Charmrun> adding client 0: "1", IP:10.0.0.101
Charmrun> adding client 1: "2", IP:10.0.0.102
Charmrun> adding client 2: "3", IP:10.0.0.103
Charmrun> adding client 3: "4", IP:10.0.0.104
Charmrun> adding client 4: "1", IP:10.0.0.101
Charmrun> adding client 5: "2", IP:10.0.0.102
Charmrun> adding client 6: "3", IPInfo: Based on Charm++/Converse 0143163 for net-linux-amd64-clustermatic
Info: Built Wed Mar 30 14:30:04 CST 2005 by root on ares
Info: Sending usage information to NAMD developers via UDP. Sent data is:
Info: 1 NAMD 2.5 Linux-amd64-Clustermatic 8 n1 25581
Info: Running on 8 processors.
Info: 8375 kB of memory in use.
Measuring processor speeds... Done.
Info: Based on Charm++/Converse 0143163 for net-linux-amd64-clustermatic
Info: Built Wed Mar 30 14:30:04 CST 2005 by root on ares
Info: Configuration file is run_0025.conf
Info: Sending usage information to NAMD developers via UDP. Sent data is:
Info: 1 NAMD 2.5 Linux-amd64-Clustermatic 8 n1 25581
Info: Running on 8 processors.
omplete.
Info: Changed directory to /scratch-cluster/parag/ares/work5
Info: EXTENDED SYSTEM FILE restrt_0017.xsc
Info: SIMULATION PARAMETERS:
Info: TIMESTEP 2
Info: NUMBER OF STEPS 1000000
Info: STEPS PER CYCLE 4
Info: PERIODIC CELL BASIS 1 45.379 0 0
Info: PERIODIC CELL BASIS 2 0 40.5114 0
Info: PERIODIC CELL BASIS 3 0 0 198.579
Info: PERIODIC CELL CENTER 0 0 0
Info: WRAPPING ALL CLUSTERS AROUND PERIODIC BOUNDARIES ON OUTPUT.
Info: LOAD BALANCE STRATEGY Other
Info: LDB PERIOD 800 steps
Info: FIRST LDB TIMESTEP 20
Info: LDB BACKGROUND SCALING 1
Info: HOM BACKGROUND SCALING 1
Info: PME BACKGROUND SCALING 1
Info: MAX SELF PARTITIONS 50
Info: MAX PAIR PARTITIONS 20
Info: SELF PARTITION ATOMS 125
Info: PAIR PARTITION ATOMS 200
Info: PAIR2 PARTITION ATOMS 400
Info: INITIAL TEMPERATURE 323
Info: CENTER OF MASS MOVING? NO
Info: DIELECTRIC 1
Info: EXCLUDE SCALED ONE-FOUR
Info: 1-4 SCALE FACTOR 1
Info: DCD FILENAME dcd_0025
Info: DCD FREQUENCY 10000
Warning: INITIAL COORDINATES WILL NOT BE WRITTEN TO DCD FILE
Info: DCD FILE WILL CONTAIN UNIT CELL DATA
Info: XST FILENAME cell_0025
Info: XST FREQUENCY 100
Info: NO VELOCITY DCD OUTPUT
Info: OUTPUT FILENAME minim_0025
Info: RESTART FILENAME restrt_0025
Info: RESTART FREQUENCY 100000
Info: BINARY RESTART FILES WILL BE USED
Info: SWITCHING ACTIVE
Info: SWITCHING ON 9
Info: SWITCHING OFF 12
Info: PAIRLIST DISTANCE 16
Info: PAIRLIST SHRINK RATE 0.01
Info: PAIRLIST GROW RATE 0.01
Info: PAIRLIST TRIGGER 0.3
Info: PAIRLISTS PER CYCLE 2
Info: PAIRLISTS ENABLED
Info: MARGIN 1.11
Info: HYDROGEN GROUP CUTOFF 2.5
Info: PATCH DIMENSION 19.61
Info: ENERGY OUTPUT STEPS 100
Info: TIMING OUTPUT STEPS 500000
Info: PRESSURE OUTPUT STEPS 100
Info: FIXED ATOMS ACTIVE
Info: LANGEVIN DYNAMICS ACTIVE
Info: LANGEVIN TEMPERATURE 323
Info: LANGEVIN DAMPING COEFFICIENT IS 5 INVERSE PS
Info: LANGEVIN DYNAMICS APPLIED TO HYDROGENS
Info: EXCLUDE FROM PRESSURE ACTIVE
Info: LANGEVIN PISTON PRESSURE CONTROL ACTIVE
Info: TARGET PRESSURE IS 1.01325 BAR
Info: OSCILLATION PERIOD IS 500 FS
Info: DECAY TIME IS 300 FS
Info: PISTON TEMPERATURE IS 323 K
Info: PRESSURE CONTROL IS GROUP-BASED
Info: INITIAL STRAIN RATE IS 5.30331e-06 -3.44143e-05 -8.25525e-06
Info: CELL FLUCTUATION IS ANISOTROPIC
Info: SURFACE TENSION CONTROL ACTIVE
Info: TARGET SURFACE TENSION IS 55 DYN/CM
Info: PARTICLE MESH EWALD (PME) ACTIVE
Info: PME TOLERANCE 1e-09
Info: PME EWALD COEFFICIENT 0.33586
Info: PME INTERPOLATION ORDER 6
Info: PME GRID DIMENSIONS 60 60 200
Info: Attempting to read FFTW data from FFTW_NAMD_2.5_Linux-amd64-Clustermatic.txt
Info: Optimizing 6 FFT steps. 1... 2... 3... 4... 5... 6... Done.
Info: Writing FFTW data to FFTW_NAMD_2.5_Linux-amd64-Clustermatic.txt
Info: FULL ELECTROSTATIC EVALUATION FREQUENCY 2
Info: USING VERLET I (r-RESPA) MTS SCHEME.
Info: C1 SPLITTING OF LONG RANGE ELECTROSTATICS
Info: PLACING ATOMS IN PATCHES BY HYDROGEN GROUPS
Info: RIGID BONDS TO HYDROGEN : ALL
Info: ERROR TOLERANCE : 1e-08
Info: MAX ITERATIONS : 100
Info: RIGID WATER USING SETTLE ALGORITHMInfo: 8375 kB of memory in use.
Measuring processor speeds...
Info: NONBONDED FORCES EVALUATED EVERY 2 STEPS
Info: RANDOM NUMBER SEED 12345
Info: USE HYDROGEN BONDS? NO
Info: COORDINATE PDB run_0017.pdb
Info: STRUCTURE FILE dppcwat_monosysf2.psf
Info: PARAMETER file: CHARMM format!
Info: PARAMETERS par_all22_prot_lipmod.inp
Info: SUMMARY OF PARAMETERS:
Info: 165 BONDS
Info: 412 ANGLES
Info: 491 DIHEDRAL
Info: 43 IMPROPER
Info: 73 VDW
Info: 5 VDW_PAIRS
Warning: Ignored 2984 bonds with zero force constants.
 Done.
Info: Configuration file is run_0025.conf
TCL: Suspending until startup complete.
Info: Changed directory to /scratch-cluster/parag/ares/work5
Info: EXTENDED SYSTEM FILE restrt_0017.xsc
Info: SIMULATION PARAMETERS:
Info: TIMESTEP 2
Info: NUMBER OF STEPS 1000000
Info: STEPS PER CYCLE 4
Info: PERIODIC CELL BASIS 1 45.379 0 0
Info: PERIODIC CELL BASIS 2 0 40.5114 0
Info: PERIODIC CELL BASIS 3 0 0 198.579
Info: PERIODIC CELL CENTER 0 0 0
Info: WRAPPING ALL CLUSTERS AROUND PERIODIC BOUNDARIES ON OUTPUT.
Info: LOAD BALANCE STRATEGY Other
Info: LDB PERIOD 800 steps
Info: FIRST LDB TIMESTEP 20
Info: LDB BACKGROUND SCALING 1
Info: HOM BACKGROUND SCALING 1
Info: PME BACKGROUND SCALING 1
Info: MAX SELF PARTITIONS 50
Info: MAX PAIR PARTITIONS 20
Info: SELF PARTITION ATOMS 125
Info: PAIR PARTITION ATOMS 200
Info: PAIR2 PARTITION ATOMS 400
Info: INITIAL TEMPERATURE 323
Info: CENTER OF MASS MOVING? NO
Info: DIELECTRIC 1
Info: EXCLUDE SCALED ONE-FOUR
Info: 1-4 SCALE FACTOR 1
Info: DCD FILENAME dcd_0025
Info: DCD FREQUENCY 10000
Warning: INITIAL COORDINATES WILL NOT BE WRITTEN TO DCD FILE
Info: DCD FILE WILL CONTAIN UNIT CELL DATA
Info: XST FILENAME cell_0025
Info: XST FREQUENCY 100
Info: NO VELOCITY DCD OUTPUT
Info: OUTPUT FILENAME minim_0025
Info: RESTART FILENAME restrt_0025
Info: RESTART FREQUENCY 100000
Info: BINARY RESTART FILES WILL BE USED
Info: SWITCHING ACTIVE
Info: SWITCHING ON 9
Info: SWITCHING OFF 12
Info: PAIRLIST DISTANCE 16
Info: PAIRLIST SHRINK RATE 0.01
Info: PAIRLIST GROW RATE 0.01
Info: PAIRLIST TRIGGER 0.3
Info: PAIRLISTS PER CYCLE 2
Info: PAIRLISTS ENABLED
Info: MARGIN 1.11
Info: HYDROGEN GROUP CUTOFF 2.5
Info: PATCH DIMENSION 19.61
Info: ENERGY OUTPUT STEPS 100
Info: TIMING OUTPUT STEPS 500000
Info: PRESSURE OUTPUT STEPS 100
Info: FIXED ATOMS ACTIVE
Info: LANGEVIN DYNAMICS ACTIVE
Info: LANGEVIN TEMPERATURE 323
Info: LANGEVIN DAMPING COEFFICIENT IS 5 INVERSE PS
Info: LANGEVIN DYNAMICS APPLIED TO HYDROGENS
Info: EXCLUDE FROM PRESSURE ACTIVE
Info: LANGEVIN PISTON PRESSURE CONTROL ACTIVE
Info: TARGET PRESSURE IS 1.01325 BAR
Info: OSCILLATION PERIOD IS 500 FS
Info: DECAY TIME IS 300 FS
Info: PISTON TEMPERATURE IS 323 K
Info: PRESSURE CONTROL IS GROUP-BASED
Info: INITIAL STRAIN RATE IS 5.30331e-06 -3.44143e-05 -8.25525e-06
Info: CELL FLUCTUATION IS ANISOTROPIC
Info: SURFACE TENSION CONTROL ACTIVE
Info: TARGET SURFACE TENSION IS 55 DYN/CM
Info: PARTICLE MESH EWALD (PME) ACTIVE
Info: PME TOLERANCE 1e-09
Info: PME EWALD COEFFICIENT 0.33586
Info: PME INTERPOLATION ORDER 6
Info: PME GRID DIMENSIONS 60 60 200
Info: Attempting to read FFTW data from FFTW_NAMD_2.5_Linux-amd64-Clustermatic.txt
Info: Optimizing 6 FFT steps. 1... 2... 3... 4... 5... 6... Done.
Info: Writing FFTW data to FFTW_NAMD_2.5_Linux-amd64-Clustermatic.txt
Info: FULL ELECTROSTATIC EVALUATION FREQUENCY 2
Info: USING VERLET I (r-RESPA) MTS SCHEME.
Info: C1 SPLITTING OF LONG RANGE ELECTROSTATICS
Info: PLACING ATOMS IN PATCHES BY HYDROGEN GROUPS
Info: RIGID BONDS TO HYDROGEN : ALL
Info: ERROR TOLERANCE : 1e-08
Info: MAX ITERATIONS : 100
Info: RIGID WATER USING SETTLE ALGORITHM
Info: NONBONDED FORCES EVALUATED EVERY 2 STEPS
Info: RANDOM NUMBER SEED 12345
Info: USE HYDROGEN BONDS? NO
Info: COORDINATE PDB run_0017.pdb
Info: STRUCTURE FILE dppcwat_monosysf2.psf
Info: PARAMETER file: CHARMM format!
Info: PARAMETERS par_all22_prot_lipmod.inp
Info: Got 1608 excluded pressure atoms.BONDS
Info: 412 ANGLES
Info: 491 DIHEDRAL
Info: 43 IMPROPER
Info: 73 VDW
Info: 5 VDW_PAIRS
Warning: Ignored 2984 bonds with zero force constants.
Warning: Will get H-H distance in rigid H2O from H-O-H angle.
Info: ****************************
Info: STRUCTURE SUMMARY:
Info: 13648 ATOMS
Info: 10612 BONDS
Info: 11984 ANGLES
Info: 12564 DIHEDRALS
Info: 72 IMPROPERS
Info: 0 EXCLUSIONS
Info: 1608 FIXED ATOMS
Info: 11832 RIGID BONDS
Info: 1608 RIGID BONDS BETWEEN FIXED ATOMS
Info: 25896 DEGREES OF FREEDOM
Info: 4800 HYDROGEN GROUPS
Info: 536 HYDROGEN GROUPS WITH ALL ATOMS FIXED
Info: TOTAL MASS = 80651.5 amu
Info: TOTAL CHARGE = 4.02331e-07 e
Info: *****************************
Info: Got 1608 excluded pressure atoms.Info: Entering startup phase 0 with 12190 kB of memory in use.
Info: ****************************
Info: STRUCTURE SUMMARY:
Info: 13648 ATOMS
Info: 10612 BONDS
Info: 11984 ANGLES
Info: 12564 DIHEDRALS
Info: 72 IMPROPERS
Info: 0 EXCLUSIONS
Info: 1608 FIXED ATOMS
Info: 11832 RIGID BONDS
Info: 1608 RIGID BONDS BETWEEN FIXED ATOMS
Info: 25896 DEGREES OF FREEDOM
Info: 4800 HYDROGEN GROUPS
Info: 536 HYDROGEN GROUPS WITH ALL ATOMS FIXED
Info: TOTAL MASS = 80651.5 amu
Info: TOTAL CHARGE = 4.02331e-07 e
Info: *****************************
Info: Entering startup phase 0 with 12190 kB of memory in use.
Info: Entering startup phase 1 with 12181 kB of memory in use.
Info: Entering startup phase 1 with 12181 kB of memory in use.
Info: Entering startup phase 2 with 17417 kB of memory in use.
Info: Entering startup phase 2 with 17417 kB of memory in use.
Info: Entering startup phase 3 with 17524 kB of memory in use.
Info: Entering startup phase 3 with 17524 kB of memory in use.
Info: PATCH GRID IS 2 (PERIODIC) BY 2 (PERIODIC) BY 10 (PERIODIC)
Info: PATCH GRID IS 2 (PERIODIC) BY 2 (PERIODIC) BY 10 (PERIODIC)
Info: REMOVING COM VELOCITY 0.0588433 -0.0529249 -0.0782718
Info: REMOVING COM VELOCITY 0.0588433 -0.0529249 -0.0782718
Info: LARGEST PATCH (10) HAS 999 ATOMS
Info: LARGEST PATCH (10) HAS 999 ATOMS
Info: Entering startup phase 4 with 21492 kB of memory in use.
Info: Entering startup phase 4 with 21492 kB of memory in use.
Info: PME using 8 and 8 processors for FFT and reciprocal sum.
Info: PME using 8 and 8 processors for FFT and reciprocal sum.
Info: PME GRID LOCATIONS: 0 1 2 3 4 5 6 7
Info: PME TRANS LOCATIONS: 0 1 2 3 4 5 6 7
Info: PME GRID LOCATIONS: 0 1 2 3 4 5 6 7
Info: PME TRANS LOCATIONS: 0 1 2 3 4 5 6 7
Info: Optimizing 4 FFT steps. 1...Info: Optimizing 4 FFT steps. 1... 2... 3... 4... Done.
 2... 3... 4... Done.
Info: Entering startup phase 5 with 22425 kB of memory in use.
Info: Entering startup phase 5 with 22425 kB of memory in use.
Info: Entering startup phase 6 with 20978 kB of memory in use.
Info: Entering startup phase 6 with 20978 kB of memory in use.
Info: Entering startup phase 7 with 20999 kB of memory in use.
Info: Entering startup phase 7 with 20999 kB of memory in use.
Info: COULOMB TABLE R-SQUARED SPACING: 0.0625
Info: COULOMB TABLE R-SQUARED SPACING: 0.0625
Info: COULOMB TABLE SIZE: 769 POINTS
Info: COULOMB TABLE SIZE: 769 POINTS
Info: NONZERO IMPRECISION IN COULOMB TABLE: 3.9443e-31 (709) 1.57772e-29 (682)
Info: NONZERO IMPRECISION IN COULOMB TABLE: 3.9443e-31 (709) 1.57772e-29 (682)
Info: NONZERO IMPRECISION IN COULOMB TABLE: 2.11758e-22 (759) 2.91168e-22 (759)
Info: NONZERO IMPRECISION IN COULOMB TABLE: 2.11758e-22 (759) 2.91168e-22 (759)
Info: Entering startup phase 8 with 22099 kB of memory in use.
Info: Entering startup phase 8 with 22099 kB of memory in use.
Info: Finished startup with 22484 kB of memory in use.
Info: Finished startup with 22484 kB of memory in use.
PRESSURE: 0 -4483 25.3851 -1039.88 66.2416 -4581.91 738.936 -199.198 482.616 -4264.61
GPRESSURE: 0 -35.367 106.524 -1115.35 173.727 -280.875 652.35 -187.755 289.251 -154.32
ETITLE: TS BOND ANGLE DIHED IMPRP ELECT VDW BOUNDARY MISC KINETIC TOTAL TEMP TOTAL2 TOTAL3 TEMPAVG PRESSURE GPRESSURE VOLUME PRESSAVG GPRESSAVG

PRESSURE: 0 -4483 25.3851 -1039.88 66.2416 -4581.91 738.936 -199.198 482.616 -4264.61
GPRESSURE: 0 -35.367 106.524 -1115.35 173.727 -280.875 652.35 -187.755 289.251 -154.32
ETITLE: TS BOND ANGLE DIHED IMPRP ELECT VDW BOUNDARY MISC KINETIC TOTAL TEMP TOTAL2 TOTAL3 TEMPAVG PRESSURE GPRESSURE VOLUME PRESSAVG GPRESSAVG

ENERGY: 0 497.4372 2945.9299 697.7718 26.8610 -65990.2474 457.9247 0.0000 0.0000 8213.8267 -53150.4962 319.2297 -53065.4816 -53065.4816 319.2297 -4443.1718 -156.8539 365061.0449 -4443.1718 -156.8539

ENERGY: 0 497.4372 2945.9299 697.7718 26.8610 -65990.2474 457.9247 0.0000 0.0000 8213.8267 -53150.4962 319.2297 -53065.4816 -53065.4816 319.2297 -4443.1718 -156.8539 365061.0449 -4443.1718 -156.8539

OPENING EXTENDED SYSTEM TRAJECTORY FILE
OPENING EXTENDED SYSTEM TRAJECTORY FILE
Info: Initial time: 8 CPUs 0.156147 s/step 0.903626 days/ns 40608 kB memory
LDB: LOAD: AVG 1.68147 MAX 2.42618 MSGS: TOTAL 152 MAXC 28 MAXP 7 None
LDB: LOAD: AVG 1.68147 MAX 2.42618 MSGS: TOTAL 152 MAXC 28 MAXP 7 None
LDB: LOAD: AVG 1.68147 MAX 1.80888 MSGS: TOTAL 152 MAXC 28 MAXP 7 Alg7
LDB: LOAD: AVG 1.68147 MAX 1.80888 MSGS: TOTAL 152 MAXC 28 MAXP 7 Alg7
LDB: LOAD: AVG 1.68147 MAX 1.71406 MSGS: TOTAL 152 MAXC 28 MAXP 7 Alg7
LDB: LOAD: AVG 1.68147 MAX 1.71406 MSGS: TOTAL 152 MAXC 28 MAXP 7 Alg7

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:39:31 CST