Re: Trouble with NAMD on Myrinet

From: Gengbin Zheng (gzheng_at_ks.uiuc.edu)
Date: Tue Mar 15 2005 - 14:14:14 CST

Hi,

 Have you tried this as described in NAMD wiki:

http://www.ks.uiuc.edu/Research/namd/wiki/index.cgi?NamdOnMyrinet

Gengbin

Edward Patrick Obrien wrote:

> Hi All,
> We compiled NAMD for myrinet but it seems to work sometimes, but not
> correctly, and other times it dies completely (Output Errors are
> listed at end of this message). Has anyone gotten NAMD-2.5 to work
> with Myrinet? Here's some info:
>
> I build NAMD as follows (once I've set up all the files describing
> where plugins, TCL, etc. are and edititing conv-mach.sh with out
> correct MPICH compilers):
>
> cd charm
> ./build charm++ mpi-linux -O -DCMK_OPTIMIZE=1
> cd ..
> ./config tcl fftw plugins Linux-i686-MPI
>
> This creates the namd2 executable in the Linux-i686-MPI directory. I
> then run it via qsub. One problem seems to be in the
> src/arch/mpi/machine.c
> file the assertion on line 815 seems to get triggered. I broke it into
> two assertions to test and the problem is (startpe<Cmi_numpes) is
> FALSE. I had the program print out startpe (which based on skimming
> the source seems to be the MPI ID of the root process -- Cmi_numpres
> is the total number of processes) andstartpe is HUGE (16535) which
> suggests that the value is getting corrupted somewhere. I tried just
> resetting it to 0 when that occurs, just to test, and thatdidn't help
> (as I more or less figured).
>
> This may or may not have to do with the errors I have outlined below,
> but maybe someone on the NAMD list knows better.
>
> Here are the 2 types of errors I get:
>
> Error type 1:
>
> FATAL ERROR 17 on MPI node 2 (n13): the GM port on MPI node 0 (n12) is
> closed, i.e. the process has not started, has exited or isdead
> Small/Ctrl message completion error!
> FATAL ERROR 17 on MPI node 3 (n13): the GM port on MPI node 0 (n12) is
> closed, i.e. the process has not started, has exited or isdead
>
> Error type 1:
>
> CCS: Unknown CCS handler name '' requested. Ignoring...
> CCS: Unknown CCS handler name '' requested. Ignoring...
>
> These errors appear after finishing the startup phase:
>
> "Info: Finished startup with 50830 kB of memory in use."
>
>
> System info:
>
> linux cluster, myrinet connections.
>
>
> pbs file:
>
> #!/bin/csh
> #PBS -r n
> #PBS -m b
> #PBS -m e
> #PBS -k eo
> #PBS -l nodes=2:ppn=2
> echo Running on host `hostname`
> echo Time is `date`
> echo Directory is `pwd`
> echo This jobs runs on the following processors:
> echo `cat $PBS_NODEFILE`
> cp $PBS_NODEFILE pbs_nodefile
> set NPROCS = `wc -l < $PBS_NODEFILE`
> echo This job has allocated $NPROCS nodes
>
> set dir1 = /v/apps/mpich-1.2.5..12_04_01_2004/bin
> #set dir1 = /v/apps/mpich-gm-gnu/bin
> set dir2 = /v/estor3/home/edobrien/NAMD-tim
> set nodelist = /v/estor3/home/edobrien/Projects/nodelist
>
> cd $PBS_O_WORKDIR
>
> $dir1/mpirun -machinefile $nodelist -np 4 $dir2/namd2
> myrinet_test.namd >& myrinet_test_13.log
>
>
> Thanks,
> Ed

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:39:15 CST