Re: scalability problem on linux cluster

From: Giacomo Fiorin (gfiorin_at_seas.upenn.edu)
Date: Wed Nov 12 2008 - 10:30:25 CST

No. First, 127.0.0.1 is the IP address of the local machine
connecting to itself, anyway. It tries a remote login with rsh (the
error message is quite informative!), and it fails.

Then, if you use ++local, you'll only launch as many processES as you
want, but always you stick with the same processORS. Unfortunately,
the message: "Info: Running on 4 processors." is misleading, because
it only counts processES, not knowing where they are running.

You should probably read notes.txt to see how to make charmrun aware
of the list of nodes in your cluster. But you also have InfiniBand,
so better use it! *-TCP binaries ignore it and use the Ethernet.
You definitely need to compile NAMD for your cluster.

http://www.ks.uiuc.edu/Research/namd/wiki/index.cgi?NamdOnInfiniBand

On 11:14am, Ruchi Sachdeva wrote me:

> Hi Giacomo,
>
> If I don't use ++local and run the job on 4 cpus, then I get following error

> in the log file:
>
> connect to address 127.0.0.1: No route to host
> connect to address 127.0.0.1: No route to host
> connect to address 127.0.0.1: No route to host
> connect to address 127.0.0.1: No route to host
> trying normal rsh (/usr/bin/rsh)
> connect to address 127.0.0.1: No route to host
> trying normal rsh (/usr/bin/rsh)
> localhost.localdomain: No route to host
> Charmrun> Error 1 returned from rsh (localhost:0)
> No route to host
> localhost.localdomain: No route to host
>
> And with ++local, the log file mentions the number of processors on which I
> launch the job, like this:
>
> Info: Based on Charm++/Converse 50900 for net-linux-tcp-iccstatic
> Info: Built Wed Aug 30 13:00:33 CDT 2006 by jim on verdun.ks.uiuc.edu
> Info: 1 NAMD  2.6  Linux-i686-TCP  4    n98  rsachdeva
> Info: Running on 4 processors.
>
> So that means the job is getting distributed on the right number of
> processors. Isn't it? Am I getting it correct?
>
> Well, thanks for your reply
>
> Ruchi
>
> On 11/12/08, Giacomo Fiorin <gfiorin_at_seas.upenn.edu> wrote:
> Hi Ruchi, if you use ++local, you'll keep running only on the
> first
> node.  You actually create N processes, but they get distributed
> always among two processors only.
>
> Giacomo
>
>
> ---- -----
>   Giacomo Fiorin
>    Center for Molecular Modeling at
>      University of Pennsylvania
>      231 S 34th Street, Philadelphia, PA 19104-6323
>   phone:   (+1)-215-573-4773
>   fax:     (+1)-215-573-6233
>   mobile:  (+1)-267-324-7676
>   mail:    giacomo.fiorin_<at>_gmail.com
>   web:     http://www.cmm.upenn.edu/
> ---- ----
>
>
>
>
> On Wed, Nov 12, 2008 at 9:28 AM, Ruchi Sachdeva
> <ruchi.namd_at_gmail.com> wrote:
> > Dear All,
> >
> > I am using NAMD2.6 (pre compiled binaries) on linux (x86_64)
> 288 nodes
> > cluster based on HP Intel Xeon-based ProLiant systems. It has
> InfiniBand
> > 10Gbps cluster interconnect. I ran apoA1 test job on different
> number of
> > processors as follows:
> >
> > /nfshomen278/rsachdeva/NAMD_2.6_Linux-i686-TCP/charmrun
> > /nfshomen278/rsachdeva/NAMD_2.6_Linux-i686-TCP/namd2 ++local
> +p2 apoa1.namd
> >> apoa1.log &
> >
> > The jobs were submiited using bsub command. I got the
> following speed:
> >
> > Benchmark time: 1 CPUs 3.12916 s/step 36.2171 days/ns
> >
> > Benchmark time: 2 CPUs 1.62206 s/step 18.7738 days/ns
> >
> > Benchmark time: 4 CPUs 1.65563 s/step 19.1624 days/ns
> >
> > Benchmark time: 8 CPUs 1.64875 s/step 19.0828 days/ns
> >
> > Benchmark time: 16 CPUs 1.67945 s/step 19.4381 days/ns
> >
> > As we can see that CPU effiecieny is not increasing beyond 2
> cpus. With 4 &
> > more number of cpus, runtime is not decreasing much, rather it
> is increasing
> > with 4 & 16-cpus. Can anybody please tell me why I am
> getting  poor
> > performance with greater number of cpus?
> >
> > Shall I gain better scalability if I compile namd on the
> cluster rather than
> > using pre compiled binaries? And which version of namd would
> be better:
> > charm based or mpi based?
> >
> > Thanks in advance
> >
> > Ruchi
> >
> >
>
>
>
>

-- 
---- -----
  Giacomo Fiorin
    Center for Molecular Modeling at
      University of Pennsylvania
    231 S 34th Street, Philadelphia, PA 19104-6323
    phone:   (+1)-215-573-4773
    fax:     (+1)-215-573-6233
    mobile:  (+1)-267-324-7676
    mail:    giacomo.fiorin_<at>_gmail.com
             gfiorin_<at>_seas.upenn.edu
    web:     http://www.cmm.upenn.edu/
---- ----

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:48:37 CST