Re: scalability problem on linux cluster

From: Ruchi Sachdeva (ruchi.namd_at_gmail.com)
Date: Wed Nov 12 2008 - 10:14:49 CST

Hi Giacomo,

If I don't use ++local and run the job on 4 cpus, then I get following error
in the log file:

connect to address 127.0.0.1: No route to host
connect to address 127.0.0.1: No route to host
connect to address 127.0.0.1: No route to host
connect to address 127.0.0.1: No route to host
trying normal rsh (/usr/bin/rsh)
connect to address 127.0.0.1: No route to host
trying normal rsh (/usr/bin/rsh)
localhost.localdomain: No route to host
Charmrun> Error 1 returned from rsh (localhost:0)
No route to host
localhost.localdomain: No route to host

And with ++local, the log file mentions the number of processors on which I
launch the job, like this:

Info: Based on Charm++/Converse 50900 for net-linux-tcp-iccstatic
Info: Built Wed Aug 30 13:00:33 CDT 2006 by jim on verdun.ks.uiuc.edu
Info: 1 NAMD 2.6 Linux-i686-TCP 4 n98 rsachdeva
Info: Running on 4 processors.

So that means the job is getting distributed on the right number of
processors. Isn't it? Am I getting it correct?

Well, thanks for your reply

Ruchi

On 11/12/08, Giacomo Fiorin <gfiorin_at_seas.upenn.edu> wrote:
>
> Hi Ruchi, if you use ++local, you'll keep running only on the first
> node. You actually create N processes, but they get distributed
> always among two processors only.
>
> Giacomo
>
>
> ---- -----
> Giacomo Fiorin
> Center for Molecular Modeling at
> University of Pennsylvania
> 231 S 34th Street, Philadelphia, PA 19104-6323
> phone: (+1)-215-573-4773
> fax: (+1)-215-573-6233
> mobile: (+1)-267-324-7676
> mail: giacomo.fiorin_<at>_gmail.com
> web: http://www.cmm.upenn.edu/
> ---- ----
>
>
>
>
> On Wed, Nov 12, 2008 at 9:28 AM, Ruchi Sachdeva <ruchi.namd_at_gmail.com>
> wrote:
> > Dear All,
> >
> > I am using NAMD2.6 (pre compiled binaries) on linux (x86_64) 288 nodes
> > cluster based on HP Intel Xeon-based ProLiant systems. It has InfiniBand
> > 10Gbps cluster interconnect. I ran apoA1 test job on different number of
> > processors as follows:
> >
> > /nfshomen278/rsachdeva/NAMD_2.6_Linux-i686-TCP/charmrun
> > /nfshomen278/rsachdeva/NAMD_2.6_Linux-i686-TCP/namd2 ++local +p2
> apoa1.namd
> >> apoa1.log &
> >
> > The jobs were submiited using bsub command. I got the following speed:
> >
> > Benchmark time: 1 CPUs 3.12916 s/step 36.2171 days/ns
> >
> > Benchmark time: 2 CPUs 1.62206 s/step 18.7738 days/ns
> >
> > Benchmark time: 4 CPUs 1.65563 s/step 19.1624 days/ns
> >
> > Benchmark time: 8 CPUs 1.64875 s/step 19.0828 days/ns
> >
> > Benchmark time: 16 CPUs 1.67945 s/step 19.4381 days/ns
> >
> > As we can see that CPU effiecieny is not increasing beyond 2 cpus. With 4
> &
> > more number of cpus, runtime is not decreasing much, rather it is
> increasing
> > with 4 & 16-cpus. Can anybody please tell me why I am getting poor
> > performance with greater number of cpus?
> >
> > Shall I gain better scalability if I compile namd on the cluster rather
> than
> > using pre compiled binaries? And which version of namd would be better:
> > charm based or mpi based?
> >
> > Thanks in advance
> >
> > Ruchi
> >
> >
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:48:37 CST