Re: mpi problems on opteron

From: Gengbin Zheng (gzheng_at_ks.uiuc.edu)
Date: Wed Jul 27 2005 - 00:08:32 CDT

Charm++ can be configured on top of MPI. NAMD binary in this version is
just like a normal MPI binary.
So one has to configure mpi environment to run an MPI job such as a
simple hello program.

In MPI version of NAMD, a nodelist file is ignored. Instead, just like a
normal MPI program, you need
to write a machinefile which simply lists all compute nodes and provide
it in command line option
mpirun -machinefile ./mymachinefile -np 8 ./namd2 ...
You don't have to run with charmrun. Charmrun in MPI version is just a
simple shell script wrapper that calls mpirun.
If you plan to run the job with job scheduler like PBS, you may have to
do it differently. You need to refer job scheduler
document for how to run an MPI job.

One then needs to configure ssh to be able to login to those compute
nodes without typing password.
To do that, you just need to put ssh public key into ~/.ssh/authorized_keys.
Try a command like: ssh compute_node ls
to make sure it actually works.

Anyway, this is just some standard procedure configuring MPI which I
believe you can find plenty of information on MPI homepage.

Gengbin

Leandro Martínez wrote:

>No, we are running namd with charm++ and the performance is
>really good, at least for my simulations. We have tried to run
>gromacs with mpi and we could not obtain the same scalability.
>If charm++ uses mpi or something like that and my answer was
>stupid, I'm sorry.
>Leandro.
>
>
>On 7/25/05, Kyle Gustafson <kgustaf_at_umd.edu> wrote:
>
>
>>Leandro,
>>
>>Thanks for the reply. I've not decided between ssh and rsh
>>yet. Do you run with MPI?
>>
>>Kyle
>>
>>---- Original message ----
>>
>>
>>>Date: Mon, 25 Jul 2005 21:00:04 -0300
>>>From: Leandro Martínez <leandromartinez98_at_gmail.com>
>>>Subject: Re: namd-l: mpi problems on opteron
>>>To: Kyle Gustafson <kgustaf_at_umd.edu>
>>>Cc: namd-l_at_ks.uiuc.edu
>>>
>>>Hi Kyle,
>>>We have a cluster similar to yours, but running fedora.
>>>
>>>
>>Probably the
>>
>>
>>>problem is that you need to set ssh to be used without passwords
>>>between the nodes. We are actually using rsh in our nodes instead
>>>because it was easier to configure. You need to put in your
>>>home directory a file named .rhosts containing
>>>
>>>143.106.51.147 username
>>>127.0.0.1 username
>>>192.168.0.100 username
>>>192.168.0.101 username
>>>192.168.0.102 username
>>>.
>>>.
>>>
>>>and this file shoud have the permisions changed by
>>>
>>>chown chmod og-rwx .rhosts
>>>
>>>This file must be in your home directory in all nodes (in our
>>>
>>>
>>case all
>>
>>
>>>nodes share the same /home, so it was simpler)
>>>
>>>You can search for better documentation on the web on that,
>>>
>>>
>>I'm not
>>
>>
>>>quite and expert on this subject, I only did what was
>>>
>>>
>>necessary to get
>>
>>
>>>namd running.
>>>
>>>Leandro.
>>>
>>>
>>>
>>>--------------------------------------------------------------------
>>>Leandro Martinez
>>>Institute of Chemistry
>>>State University of Campinas
>>>http://www.ime.unicamp.br/~martinez/packmol
>>>--------------------------------------------------------------------
>>>
>>>
>>>
>>>On 7/25/05, Kyle Gustafson <kgustaf_at_umd.edu> wrote:
>>>
>>>
>>>>Hi all,
>>>>
>>>>I have an 18 opteron cluster running SuSE 2.4.21-143-numa
>>>>I'm trying to install NAMD, which requires me to install
>>>>
>>>>
>>charm++
>>
>>
>>>>After ./build charm++ mpi-linux-amd64 -nobs -O -DCMK_OPTIMIZE
>>>>I ran megatest. !!All of the one processor tests work fine!!,
>>>>but with +p2, I get the error below, where it looks like
>>>>charmrun is unable to use ssh. I can ssh back and forth from
>>>>any one node to any other, so I don't understand how this
>>>>problem could occur, because I don't know enough about ssh and
>>>>charm++. It seems like charm++ doesn't have access to the ssh
>>>>keys, but this seems crazy. My .nodelist file reads, where
>>>>head is the master and node00x is a slave. The nodelist file
>>>>is located in the HOME/charm directory, but I also tried
>>>>putting .nodelist in the megatest directory.
>>>>
>>>>group main
>>>>host head ++shell ssh
>>>>host node001 ++shell ssh
>>>>host node002 ++shell ssh
>>>>host node003 ++shell ssh
>>>>host node004 ++shell ssh
>>>>host node005 ++shell ssh
>>>>host node006 ++shell ssh
>>>>host node007 ++shell ssh
>>>>host node008 ++shell ssh
>>>>
>>>>
>>>>This is the error when I charmrun.
>>>>
>>>>I greatly appreciate your attention.
>>>>
>>>>
>>>>head:/home/namd2/NAMD_2.5_Source/charm/tests/charm++/megatest
>>>># ./charmrun +p2 ./pgm
>>>>
>>>>Running on 2 processors: ./pgm
>>>>26005: ssh_exchange_identification: Connection closed by
>>>>remote host
>>>>p0_26000: p4_error: Child process exited while making
>>>>connection to remote process on head: 0
>>>>
>>>>Kyle B. Gustafson
>>>>Department of Physics
>>>>University of Maryland
>>>>Box 45
>>>>082 Regents Drive
>>>>College Park, MD 20742
>>>>
>>>>
>>>>
>>Kyle B. Gustafson
>>Department of Physics
>>University of Maryland
>>College Park, MD USA
>>
>>
>>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:39:45 CST