The NAMD job do not go forward

From: Fabio Passetti (passetti_at_inca.gov.br)
Date: Fri Jan 05 2007 - 12:43:13 CST

Dear NAMD users,

I have just installed a new cluster with 9 servers having 2 Intel Xeon
Dual-core 3.0GHz and 4GB RAM each. I am using the Open SuSE 10.2 OS. I also
tried with the SuSE Linux Enterprise Server 10 but nothing changed.

I tried to submit a very simple NAMD job that I was used to do and I found
a problem. I do not know why, but when I run the job locally, it runs okay,
but if I try to run using 2 or more servers (no matter the number of
processes submitted, e.g. ++p 6 or ++p 10) it freezes when NAMD start to
allocate memory. I tried to change the servers to see it was something
about a particular server, but I still have the problem no matter the
server or the number of them that I use. I can access using rsh all the
servers without a password successfully.

Charmrun> charmrun started...
Charmrun> using /home/cluster/Protein/namd/nodelist as nodesfile
Charmrun> rsh (lbbc02:0) started
Charmrun> rsh (lbbc02:1) started
Charmrun> rsh (lbbc03:2) started
Charmrun> rsh (lbbc03:3) started
Charmrun> node programs all started
Charmrun rsh(lbbc03.2)> remote responding...
Charmrun rsh(lbbc02.1)> starting node-program...
Charmrun rsh(lbbc02.1)> rsh phase successful.
Charmrun rsh(lbbc03.3)> starting node-program...
Charmrun rsh(lbbc03.3)> rsh phase successful.
Charmrun rsh(lbbc03.2)> starting node-program...
Charmrun rsh(lbbc03.2)> rsh phase successful.
Charmrun> node programs all connected
Charmrun> adding client 0: "lbbc02", IP:10.46.11.53
Charmrun> adding client 1: "lbbc02", IP:10.46.11.53
Charmrun> adding client 2: "lbbc03", IP:127.0.0.1
Charmrun> adding client 3: "lbbc03", IP:127.0.0.1
Charmrun> Charmrun = lbbc03, port = 50639
Charmrun> Sending "0 lbbc03 50639 25145 0" to client 0.
Charmrun> find the node program "/home/cluster/Protein/namd/namd2" at
"/home/cluster/Protein/namd" for 0.
Charmrun> Starting rsh lbbc02 -l cluster /bin/sh -f
Charmrun> Sending "1 lbbc03 50639 25145 0" to client 1.
Charmrun> find the node program "/home/cluster/Protein/namd/namd2" at
"/home/cluster/Protein/namd" for 1.
Charmrun> Starting rsh lbbc02 -l cluster /bin/sh -f
Charmrun> Sending "2 lbbc03 50639 25145 0" to client 2.
Charmrun> find the node program "/home/cluster/Protein/namd/namd2" at
"/home/cluster/Protein/namd" for 2.
Charmrun> Starting rsh lbbc03 -l cluster /bin/sh -f
Charmrun> Sending "3 lbbc03 50639 25145 0" to client 3.
Charmrun> find the node program "/home/cluster/Protein/namd/namd2" at
"/home/cluster/Protein/namd" for 3.
Charmrun> Starting rsh lbbc03 -l cluster /bin/sh -f
Charmrun> waiting for rsh (lbbc02:0), pid 25146
Charmrun> waiting for rsh (lbbc02:1), pid 25147
Charmrun> waiting for rsh (lbbc03:2), pid 25148
Charmrun> waiting for rsh (lbbc03:3), pid 25150
Charmrun> Waiting for 0-th client to connect.
Charmrun> client 0 connected (IP=10.46.11.53 data_port=32785)
Charmrun> Waiting for 1-th client to connect.
Charmrun> client 1 connected (IP=10.46.11.53 data_port=32786)
Charmrun> Waiting for 2-th client to connect.
Charmrun> client 2 connected (IP=127.0.0.1 data_port=32787)
Charmrun> Waiting for 3-th client to connect.
Charmrun> client 3 connected (IP=127.0.0.1 data_port=32788)
Charmrun> All clients connected.
Charmrun> IP tables sent.
Info: NAMD 2.6 for Linux-amd64
Info:
Info: Please visit http://www.ks.uiuc.edu/Research/namd/
Info: and send feedback or bug reports to namd_at_ks.uiuc.edu
Info:
Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
Info: in all publications reporting results obtained with NAMD.
Info:
Info: Based on Charm++/Converse 50900 for net-linux-amd64-iccstatic
Info: Built Wed Aug 30 12:54:51 CDT 2006 by jim on belfast.ks.uiuc.edu
Info: 1 NAMD 2.6 Linux-amd64 4 lbbc02 cluster
Info: Running on 4 processors.
Info: 7608 kB of memory in use.
Info: Memory usage based on mallinfo
Info: Changed directory to /home/cluster/Protein/namd/protocols
Info: Configuration file is MD-2ns.namd
TCL: Suspending until startup complete.
Info: SIMULATION PARAMETERS:
Info: TIMESTEP 2
Info: NUMBER OF STEPS 0
Info: STEPS PER CYCLE 10
Info: PERIODIC CELL BASIS 1 67.1 0 0
Info: PERIODIC CELL BASIS 2 0 72.6 0
Info: PERIODIC CELL BASIS 3 0 0 70.3
Info: PERIODIC CELL CENTER -8.4 -36.2 -28.8
Info: WRAPPING ALL CLUSTERS AROUND PERIODIC BOUNDARIES ON OUTPUT.
Info: LOAD BALANCE STRATEGY Other
Info: LDB PERIOD 2000 steps
Info: FIRST LDB TIMESTEP 50
Info: LDB BACKGROUND SCALING 1
Info: HOM BACKGROUND SCALING 1
Info: PME BACKGROUND SCALING 1
Info: MAX SELF PARTITIONS 50
Info: MAX PAIR PARTITIONS 20
Info: SELF PARTITION ATOMS 125
Info: PAIR PARTITION ATOMS 200
Info: PAIR2 PARTITION ATOMS 400
Info: MIN ATOMS PER PATCH 100
Info: INITIAL TEMPERATURE 310
Info: CENTER OF MASS MOVING INITIALLY? NO
Info: DIELECTRIC 1
Info: EXCLUDE SCALED ONE-FOUR
Info: 1-4 SCALE FACTOR 1
Info: DCD FILENAME ../output/protein_wb-MD-2ns.dcd
Info: DCD FREQUENCY 1000
Info: DCD FIRST STEP 1000
Info: DCD FILE WILL CONTAIN UNIT CELL DATA
Info: XST FILENAME ../output/protein_wb-MD-2ns.xst
Info: XST FREQUENCY 1000
Info: NO VELOCITY DCD OUTPUT
Info: OUTPUT FILENAME ../output/protein_wb-MD-2ns
Info: RESTART FILENAME ../output/protein_wb-MD-2ns.restart
Info: RESTART FREQUENCY 1000
Info: BINARY RESTART FILES WILL BE USED
Info: SWITCHING ACTIVE
Info: SWITCHING ON 10
Info: SWITCHING OFF 12
Info: PAIRLIST DISTANCE 13.5
Info: PAIRLIST SHRINK RATE 0.01
Info: PAIRLIST GROW RATE 0.01
Info: PAIRLIST TRIGGER 0.3
Info: PAIRLISTS PER CYCLE 2
Info: PAIRLISTS ENABLED
Info: MARGIN 2.5
Info: HYDROGEN GROUP CUTOFF 2.5
Info: PATCH DIMENSION 18.5
Info: ENERGY OUTPUT STEPS 100
Info: CROSSTERM ENERGY INCLUDED IN DIHEDRAL
Info: TIMING OUTPUT STEPS 100
Info: PRESSURE OUTPUT STEPS 100
Info: LANGEVIN DYNAMICS ACTIVE
Info: LANGEVIN TEMPERATURE 310
Info: LANGEVIN DAMPING COEFFICIENT IS 5 INVERSE PS
Info: LANGEVIN DYNAMICS NOT APPLIED TO HYDROGENS
Info: LANGEVIN PISTON PRESSURE CONTROL ACTIVE
Info: TARGET PRESSURE IS 1.01325 BAR
Info: OSCILLATION PERIOD IS 100 FS
Info: DECAY TIME IS 50 FS
Info: PISTON TEMPERATURE IS 310 K
Info: PRESSURE CONTROL IS GROUP-BASED
Info: INITIAL STRAIN RATE IS 0 0 0
Info: CELL FLUCTUATION IS ISOTROPIC
Info: PARTICLE MESH EWALD (PME) ACTIVE
Info: PME TOLERANCE 1e-06
Info: PME EWALD COEFFICIENT 0.257952
Info: PME INTERPOLATION ORDER 4
Info: PME GRID DIMENSIONS 72 75 72
Info: PME MAXIMUM GRID SPACING 1.5
Info: Attempting to read FFTW data from FFTW_NAMD_2.6_Linux-amd64.txt
Info: Optimizing 6 FFT steps. 1... 2... 3... 4... 5... 6... Done.
Info: Writing FFTW data to FFTW_NAMD_2.6_Linux-amd64.txt
Info: FULL ELECTROSTATIC EVALUATION FREQUENCY 2
Info: USING VERLET I (r-RESPA) MTS SCHEME.
Info: C1 SPLITTING OF LONG RANGE ELECTROSTATICS
Info: PLACING ATOMS IN PATCHES BY HYDROGEN GROUPS
Info: RIGID BONDS TO HYDROGEN : ALL
Info: ERROR TOLERANCE : 1e-08
Info: MAX ITERATIONS : 100
Info: RIGID WATER USING SETTLE ALGORITHM
Info: RANDOM NUMBER SEED 1168020387
Info: USE HYDROGEN BONDS? NO
Info: COORDINATE PDB ../output/protein_wb-MM.coor
Info: STRUCTURE FILE ../procdata/protein_wb.psf
Info: PARAMETER file: CHARMM format!
Info: PARAMETERS ../param/par_all22_prot.inp
Info: USING ARITHMETIC MEAN TO COMBINE L-J SIGMA PARAMETERS
Info: SUMMARY OF PARAMETERS:
Info: 138 BONDS
Info: 341 ANGLES
Info: 443 DIHEDRAL
Info: 43 IMPROPER
Info: 0 CROSSTERM
Info: 65 VDW
Info: 0 VDW_PAIRS
Info: ****************************
Info: STRUCTURE SUMMARY:
Info: 31728 ATOMS
Info: 21797 BONDS
Info: 13406 ANGLES
Info: 5106 DIHEDRALS
Info: 297 IMPROPERS
Info: 0 CROSSTERMS
Info: 0 EXCLUSIONS
Info: 30800 RIGID BONDS
Info: 64384 DEGREES OF FREEDOM
Info: 10877 HYDROGEN GROUPS
Info: TOTAL MASS = 192513 amu
Info: TOTAL CHARGE = 8 e
Info: *****************************
Info: Entering startup phase 0 with 13268 kB of memory in use.

When NAMD gets here it cannot go forward. The job is still there and if I
let it for hours, it will not die. All the processors are free and the
/var/log/messages do not show any error.

Any help is appreciated.

Regards,

Fabio Passetti

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:45:55 CST