Re: scalability problem on linux cluster

From: Ruchi Sachdeva (ruchi.namd_at_gmail.com)
Date: Sat Nov 15 2008 - 00:31:24 CST

Hi,

Thanks to all three of you for your kind suggestions. You pointed it out
correctly that charmrun was not able to connect to other nodes & hence the
processes were running on the headnode only whatever the '+p' option was.
So I made the charmrun to connect to the nodes without typing the password.
And this time I have got pretty good scalability with ApoA1 test job:

Benchmark time: 2 CPUs 1.61969 s/step 18.7464 days/ns
Benchmark time: 4 CPUs 0.820552 s/step 9.49713 days/ns
Benchmark time: 8 CPUs 0.439996 s/step 5.09255 days/ns
Benchmark time: 16 CPUs 0.297472 s/step 3.44297 days/ns
Benchmark time: 32 CPUs 0.19687 s/step 2.27858 days/ns

But I didn't get such good scalability when I did benchmarking with my own
system containing a protein in explicit solvent, total atoms 13539. PBC
alongwith PME was also used.

Benchmark time: 2 CPUs 0.258035 s/step 1.49326 days/ns
Benchmark time: 4 CPUs 0.133445 s/step 0.772251 days/ns
Benchmark time: 8 CPUs 0.117084 s/step 0.67757 days/ns
Benchmark time: 16 CPUs 0.0951627 s/step 0.55071 days/ns

There was only a slight increase in the speed with 8 & 16 cpus. I have read
in the mailing list that scalability depends on the number of patches in
which the system is divided. That means the scalability varies with the
system under study. And if it is so then how will I judge the performance of
namd? Please correct me if I am getting it wrong.

Well I am trying to compile mpi based NAMD on the cluster to use infiniband
as Axel has suggested and then compare the performace with ethernet based.

Thanks

With Regards

Ruchi

Giacomo Fiorin <gfiorin_at_seas.upenn.edu> wrote:
>
> No. First, 127.0.0.1 is the IP address of the local machine connecting to
> itself, anyway. It tries a remote login with rsh (the error message is
> quite informative!), and it fails.
>
> Then, if you use ++local, you'll only launch as many processES as you want,
> but always you stick with the same processORS. Unfortunately, the message:
> "Info: Running on 4 processors." is misleading, because it only counts
> processES, not knowing where they are running.
>
> You should probably read notes.txt to see how to make charmrun aware of the
> list of nodes in your cluster. But you also have InfiniBand, so better use
> it! *-TCP binaries ignore it and use the Ethernet. You definitely need to
> compile NAMD for your cluster.
>
> http://www.ks.uiuc.edu/Research/namd/wiki/index.cgi?NamdOnInfiniBand
>
>
> On 11:14am, Ruchi Sachdeva wrote me:
>
> Hi Giacomo,
>>
>> If I don't use ++local and run the job on 4 cpus, then I get following
>> error
>>
>
> in the log file:
>>
>> connect to address 127.0.0.1: No route to host
>> connect to address 127.0.0.1: No route to host
>> connect to address 127.0.0.1: No route to host
>> connect to address 127.0.0.1: No route to host
>> trying normal rsh (/usr/bin/rsh)
>> connect to address 127.0.0.1: No route to host
>> trying normal rsh (/usr/bin/rsh)
>> localhost.localdomain: No route to host
>> Charmrun> Error 1 returned from rsh (localhost:0)
>> No route to host
>> localhost.localdomain: No route to host
>>
>> And with ++local, the log file mentions the number of processors on which
>> I
>> launch the job, like this:
>>
>> Info: Based on Charm++/Converse 50900 for net-linux-tcp-iccstatic
>> Info: Built Wed Aug 30 13:00:33 CDT 2006 by jim on verdun.ks.uiuc.edu
>> Info: 1 NAMD 2.6 Linux-i686-TCP 4 n98 rsachdeva
>> Info: Running on 4 processors.
>>
>> So that means the job is getting distributed on the right number of
>> processors. Isn't it? Am I getting it correct?
>>
>> Well, thanks for your reply
>>
>> Ruchi
>>
>> On 11/12/08, Giacomo Fiorin <gfiorin_at_seas.upenn.edu> wrote:
>> Hi Ruchi, if you use ++local, you'll keep running only on the
>> first
>> node. You actually create N processes, but they get distributed
>> always among two processors only.
>>
>> Giacomo
>>
>>
>> ---- -----
>> Giacomo Fiorin
>> Center for Molecular Modeling at
>> University of Pennsylvania
>> 231 S 34th Street, Philadelphia, PA 19104-6323
>> phone: (+1)-215-573-4773
>> fax: (+1)-215-573-6233
>> mobile: (+1)-267-324-7676
>> mail: giacomo.fiorin_<at>_gmail.com
>> web: http://www.cmm.upenn.edu/
>> ---- ----
>>
>>
>>
>>
>> On Wed, Nov 12, 2008 at 9:28 AM, Ruchi Sachdeva
>> <ruchi.namd_at_gmail.com> wrote:
>> > Dear All,
>> >
>> > I am using NAMD2.6 (pre compiled binaries) on linux (x86_64)
>> 288 nodes
>> > cluster based on HP Intel Xeon-based ProLiant systems. It has
>> InfiniBand
>> > 10Gbps cluster interconnect. I ran apoA1 test job on different
>> number of
>> > processors as follows:
>> >
>> > /nfshomen278/rsachdeva/NAMD_2.6_Linux-i686-TCP/charmrun
>> > /nfshomen278/rsachdeva/NAMD_2.6_Linux-i686-TCP/namd2 ++local
>> +p2 apoa1.namd
>> >> apoa1.log &
>> >
>> > The jobs were submiited using bsub command. I got the
>> following speed:
>> >
>> > Benchmark time: 1 CPUs 3.12916 s/step 36.2171 days/ns
>> >
>> > Benchmark time: 2 CPUs 1.62206 s/step 18.7738 days/ns
>> >
>> > Benchmark time: 4 CPUs 1.65563 s/step 19.1624 days/ns
>> >
>> > Benchmark time: 8 CPUs 1.64875 s/step 19.0828 days/ns
>> >
>> > Benchmark time: 16 CPUs 1.67945 s/step 19.4381 days/ns
>> >
>> > As we can see that CPU effiecieny is not increasing beyond 2
>> cpus. With 4 &
>> > more number of cpus, runtime is not decreasing much, rather it
>> is increasing
>> > with 4 & 16-cpus. Can anybody please tell me why I am
>> getting poor
>> > performance with greater number of cpus?
>> >
>> > Shall I gain better scalability if I compile namd on the
>> cluster rather than
>> > using pre compiled binaries? And which version of namd would
>> be better:
>> > charm based or mpi based?
>> >
>> > Thanks in advance
>> >
>> > Ruchi
>> >
>> >
>>
>>
>>
>>
>>
> --
> ---- -----
> Giacomo Fiorin
> Center for Molecular Modeling at
> University of Pennsylvania
> 231 S 34th Street, Philadelphia, PA 19104-6323
> phone: (+1)-215-573-4773
> fax: (+1)-215-573-6233
> mobile: (+1)-267-324-7676
> mail: giacomo.fiorin_<at>_gmail.com
> gfiorin_<at>_seas.upenn.edu
> web: http://www.cmm.upenn.edu/
> ---- ----
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:50:06 CST