Re: Re: Trouble with NAMD on Myrinet

From: Gengbin Zheng (gzheng_at_ks.uiuc.edu)
Date: Wed Mar 16 2005 - 11:27:19 CST

building Charm++ on top of MPI/GM incurs overhead of another layer.
There is another option that you can build charm directly on top of
native GM communication library to avoid MPI overhead:

./build charm++ net-linux gm ...

This is supposed to be faster.

You may also want to try bigger data set (e.g. apoa1) to see the
difference. For small messages, even though the latency of Myrinet is
much smaller (macro seconds level), the software overhead (time CPU time
spent in sending a message) in both cases could be roughly the same,
that may be why there is not much advantage shown.

Gengbin

Edward Patrick Obrien wrote:

> Hi Gengbin,
> Thanks, the recommendations on the NAMD on Wiki page helped. NAMD on
> myrinet now seems to work.
>
> BUT, a strange thing is occuring; there is no speed up of simulation
> upon going from ethernet parallel computing to Myrinet parallel
> computing.
> For example:
>
> myrinet ethernet
> Seconds per Step 0.266 0.267
>
> The above data is for a system of ~18,000 atoms, that is run parallel
> on 2 nodes each with 2 processors. The exact same compute nodes were
> used.
>
> We checked the traffic between the compute nodes during these
> calculations and the myrinet job was communicating via myrinet and the
> ethernet via ethernet.
>
> I use charmrun for ethernet run, and mpirun for myrinet run. Some
> info on the installation of myrinet NAMD is given below.
>
> Any ideas what could be going on?
> Thanks,
> Ed
>
>
>
> On Tue, 15 Mar 2005, Edward Patrick Obrien wrote:
>
>> Hi All,
>> We compiled NAMD for myrinet but it seems to work sometimes, but not
>> correctly, and other times it dies completely (Output Errors are
>> listed at end of this message). Has anyone gotten NAMD-2.5 to work
>> with Myrinet? Here's some info:
>>
>> I build NAMD as follows (once I've set up all the files describing
>> where plugins, TCL, etc. are and edititing conv-mach.sh with out
>> correct MPICH compilers):
>>
>> cd charm
>> ./build charm++ mpi-linux -O -DCMK_OPTIMIZE=1
>> cd ..
>> ./config tcl fftw plugins Linux-i686-MPI
>>
>> This creates the namd2 executable in the Linux-i686-MPI directory. I
>> then run it via qsub. One problem seems to be in the
>> src/arch/mpi/machine.c
>> file the assertion on line 815 seems to get triggered. I broke it
>> into two assertions to test and the problem is (startpe<Cmi_numpes)
>> is FALSE. I had the program print out startpe (which based on
>> skimming the source seems to be the MPI ID of the root process --
>> Cmi_numpres is the total number of processes) andstartpe is HUGE
>> (16535) which suggests that the value is getting corrupted somewhere.
>> I tried just resetting it to 0 when that occurs, just to test, and
>> thatdidn't help (as I more or less figured).
>>
>> This may or may not have to do with the errors I have outlined below,
>> but maybe someone on the NAMD list knows better.
>>
>> Here are the 2 types of errors I get:
>>
>> Error type 1:
>>
>> FATAL ERROR 17 on MPI node 2 (n13): the GM port on MPI node 0 (n12)
>> is closed, i.e. the process has not started, has exited or isdead
>> Small/Ctrl message completion error!
>> FATAL ERROR 17 on MPI node 3 (n13): the GM port on MPI node 0 (n12)
>> is closed, i.e. the process has not started, has exited or isdead
>>
>> Error type 1:
>>
>> CCS: Unknown CCS handler name '' requested. Ignoring...
>> CCS: Unknown CCS handler name '' requested. Ignoring...
>>
>> These errors appear after finishing the startup phase:
>>
>> "Info: Finished startup with 50830 kB of memory in use."
>>
>>
>> System info:
>>
>> linux cluster, myrinet connections.
>>
>>
>> pbs file:
>>
>> #!/bin/csh
>> #PBS -r n
>> #PBS -m b
>> #PBS -m e
>> #PBS -k eo
>> #PBS -l nodes=2:ppn=2
>> echo Running on host `hostname`
>> echo Time is `date`
>> echo Directory is `pwd`
>> echo This jobs runs on the following processors:
>> echo `cat $PBS_NODEFILE`
>> cp $PBS_NODEFILE pbs_nodefile
>> set NPROCS = `wc -l < $PBS_NODEFILE`
>> echo This job has allocated $NPROCS nodes
>>
>> set dir1 = /v/apps/mpich-1.2.5..12_04_01_2004/bin
>> #set dir1 = /v/apps/mpich-gm-gnu/bin
>> set dir2 = /v/estor3/home/edobrien/NAMD-tim
>> set nodelist = /v/estor3/home/edobrien/Projects/nodelist
>>
>> cd $PBS_O_WORKDIR
>>
>> $dir1/mpirun -machinefile $nodelist -np 4 $dir2/namd2
>> myrinet_test.namd >& myrinet_test_13.log
>>
>>
>> Thanks,
>> Ed
>>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:40:36 CST