Trouble with NAMD on Myrinet

From: Edward Patrick Obrien (edobrien_at_Glue.umd.edu)
Date: Tue Mar 15 2005 - 14:02:40 CST

Hi All,
   We compiled NAMD for myrinet but it seems to work sometimes, but not
correctly, and other times it dies completely (Output Errors are listed
at end of this message). Has anyone gotten NAMD-2.5 to work with Myrinet?
Here's some info:

I build NAMD as follows (once I've set up all the files describing where
plugins, TCL, etc. are and edititing conv-mach.sh with out correct MPICH
compilers):

cd charm
./build charm++ mpi-linux -O -DCMK_OPTIMIZE=1
cd ..
./config tcl fftw plugins Linux-i686-MPI

This creates the namd2 executable in the Linux-i686-MPI directory. I then
run it via qsub. One problem seems to be in the src/arch/mpi/machine.c
file the assertion on line 815 seems to get triggered. I broke it into two
assertions to test and the problem is (startpe<Cmi_numpes) is FALSE. I had
the program print out startpe (which based on skimming the source seems to
be
the MPI ID of the root process -- Cmi_numpres is the total number of
processes) andstartpe is HUGE (16535) which suggests that the value is
getting corrupted somewhere. I tried just resetting it to 0 when that
occurs, just to test, and thatdidn't help (as I more or less figured).

This may or may not have to do with the errors I have outlined below, but
maybe someone on the NAMD list knows better.

Here are the 2 types of errors I get:

Error type 1:

FATAL ERROR 17 on MPI node 2 (n13): the GM port on MPI node
0 (n12) is closed, i.e. the process has not started, has
exited or isdead
Small/Ctrl message completion error!
FATAL ERROR 17 on MPI node 3 (n13): the GM port on MPI node
0 (n12) is closed, i.e. the process has not started, has
exited or isdead

Error type 1:

CCS: Unknown CCS handler name '' requested. Ignoring...
CCS: Unknown CCS handler name '' requested. Ignoring...

These errors appear after finishing the startup phase:

"Info: Finished startup with 50830 kB of
memory in use."

System info:

linux cluster, myrinet connections.

pbs file:

#!/bin/csh
#PBS -r n
#PBS -m b
#PBS -m e
#PBS -k eo
#PBS -l nodes=2:ppn=2
echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`
echo This jobs runs on the following processors:
echo `cat $PBS_NODEFILE`
cp $PBS_NODEFILE pbs_nodefile
set NPROCS = `wc -l < $PBS_NODEFILE`
echo This job has allocated $NPROCS nodes

set dir1 = /v/apps/mpich-1.2.5..12_04_01_2004/bin
#set dir1 = /v/apps/mpich-gm-gnu/bin
set dir2 = /v/estor3/home/edobrien/NAMD-tim
set nodelist = /v/estor3/home/edobrien/Projects/nodelist

cd $PBS_O_WORKDIR

$dir1/mpirun -machinefile $nodelist -np 4 $dir2/namd2 myrinet_test.namd >&
myrinet_test_13.log

Thanks,
Ed

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:40:36 CST