Re: Not getting good speed up

From: Dow_Hurst (dhurst_at_mindspring.com)
Date: Tue Mar 15 2005 - 12:31:11 CST

Next message: Edward Patrick Obrien: "Trouble with NAMD on Myrinet"
Previous message: Mauricio Carrillo Tripp: "Re: Not getting good speed up"
Maybe in reply to: Gengbin Zheng: "Re: Not getting good speed up"
Next in thread: Mauricio Carrillo Tripp: "Re: Not getting good speed up"
Reply: Mauricio Carrillo Tripp: "Re: Not getting good speed up"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

Well, I have read that some GigE NICs and GigE switches do not have good latency while some 100Mb switches and NICs would demonstrate better latency than GigE hardware. The low bandwidth of the current Cisco switch at 12Gbps would mean that only a few ports could burst to fullduplex wire speed at a time. Maybe your running halfduplex by default with that switch as well. Some switches can tell you which ports are full duplex or halfduplex. Becker's tools for analyzing ethernet cards from the scyld website can tell you what your local NICs are set at, and you can force full duplex with driver switches. I don't know enough about how charm++ works yet to really comment more. ;-) It is good your doing such a thorough job documenting this phenomena because how else will anyone know what will work? I sure wish we had on the NAMD wiki a page of ranked hardware for good, exceptional, and poor performance. We will be purchasing in a couple of weeks so I am working hard at identifying what hardware will work an
d that I can test on. The jumbo packets idea probably requires each node to also have matching MTUs and the proper kernel level jumbo packet support. I also don't know enough about that to really comment yet. ;-)

One thing I have learned is that the current NAMD for x86 using TCP runs just fine on an Opteron. I haven't yet compiled charm++ or NAMD for the Opteron so can't say what kind of performance increase that might yield. The last Cluster Computing magazine had an article talking about how 32bit code sometimes runs faster on a 64bit CPU/kernel due how the code fits into the registers. It was a bit over my head but was encouraging. My current specs for a two node dual 246 2.0GHz Opteron test cluster over GigE showed the apoa1 benchmark following the dual 3.06GHz Xeon line exactly for 1-4 CPUs that is on the NAMD performance benchmark graph. I have no other scaling data to add since the cluster is so small.
Best wishes,
Dow

-----Original Message-----
From: Mauricio Carrillo Tripp <trippm_at_gmail.com>
Sent: Mar 15, 2005 11:55 AM
To: Dow Hurst <Dow.Hurst_at_mindspring.com>
Cc: fellers_at_wabash.edu, namd-l_at_ks.uiuc.edu
Subject: Re: namd-l: Not getting good speed up

Thank you for your suggestions,

I disabled completely the QoS on the switch (that would take care of the
two queues you mention?). The default MTU is 1500, but I changed this
value to lesser (750) or grater (up to 9198, jumbo size) values.
None of these changes had an effect on the scaling whatsoever, still
really bad.
We don't think it's a bandwidth issue, since we did a test with an older
switch (Cisco Catalyst 2900, 100TX) with a much lower bandwidth, and the scaling
was good (See Fig 4 at
http://chem.acad.wabash.edu/~trippm/Clusters/performance.php).

We're still clueless...

On Sun, 13 Mar 2005 15:59:38 -0500, Dow Hurst <Dow.Hurst_at_mindspring.com> wrote:
> You can look to see if QoS is limiting your rate in the Cisco switch. I
> can't see from the link you provided what the defaults are on the Cisco
> QoS settings. Maybe your getting limited that way. It does have two
> queues for rate limiting so you'd need to check both. Also, the cheaper
> switch has a higher bandwidth from what I can tell. I wonder if you
> could use a jumbo packet enabled switch if you could do even better.
> The MTU on the Cisco is 1522. Can't tell from the links on the Dell
> what it supports as max MTU.
> Dow
>
>
> Mauricio Carrillo Tripp wrote:
>
> >Hi again,
> >
> >I've tried a lot of things; turning off the internal firewall on the
> >compute nodes,
> >turning ITR off, compiling my own namd2 versions (tcp, udp, mpi),
> >and nothing helped to considerably improve the scaling.
> >
> >As my last resource, I swapped the switches, and to my surprise,
> >the cheap switch (Dell) gave me the scaling I was expecting, in comparison
> >with the crappy scaling the expensive switch (Cisco) is giving (look
> >at Fig. 3 at
> >http://chem.acad.wabash.edu/~trippm/Clusters/performance.php).
> >So now I know the problem is in the Cisco switch.
> >I've also tried disabling spantree, enabling fastport and setting the ports to
> >be 1000tx all the time instead of auto detect. Nothing changes.
> >
> >Any other thoughts or suggestions??
> >
> >
> >
> >
> >On Thu, 3 Mar 2005 14:44:55 -0500, Mauricio Carrillo Tripp
> ><trippm_at_gmail.com> wrote:
> >
> >
> >>sorry about that stupid mistake, it's building now...
> >>
> >>I tried the option +giga (and also +strategy USE_MESH and +strategy USE_GRID)
> >>All of them improved a little the scaling, but not enough though.
> >>It is true that the new cluster is using Rocks, and that's why I want
> >>to compare the
> >>behaviour off all different versions of charm++...
> >>I'll keep you all posted on my findings...
> >>Thanks.
> >>
> >>
> >>On Thu, 03 Mar 2005 13:34:56 -0600, Gengbin Zheng <gzheng_at_ks.uiuc.edu> wrote:
> >>
> >>
> >>>For UDP namd, use command line option: "+giga" which turned on some
> >>>setting (such as UDP window protocol settings) which could speedup the
> >>>performance. You can give it a try.
> >>>
> >>>I noticed that the OS of your new cluster is Rock. I am not familar with
> >>>it, but I assume it has something different in the way it launch a
> >>>parallel job.
> >>>
> >>>The error I have when building charm:
> >>>
> >>>make: *** No rule to make target `charmm++'. Stop.
> >>>-------------------------------------------------
> >>>Charm++ NOT BUILT. Either cd into mpi-linux-icc/tmp and try
> >>>
> >>>this is because you misspelled charm++ to charmm++.
> >>>
> >>>Gengbin
> >>>
> >>>Mauricio Carrillo Tripp wrote:
> >>>
> >>>
> >>>
> >>>>Hi Gengbin, thanks for your answer. I did the comparison you recommend,
> >>>>TCP vs UDP (I didn't compile from source though, I used the
> >>>>executables NAMD supplies). The results are on Fig 2 at
> >>>>http://chem.acad.wabash.edu/~trippm/Clusters/performance.php.
> >>>>Indeed, I get an increase in performance but not good enough.
> >>>>Using the TCP version on the old cluster (lg66) did show good scaling,
> >>>>but that's not the case for the new cluster (lgrocks).
> >>>>Any ideas why is this, anybody?
> >>>>
> >>>>I'm trying to compile different versions of charm++ (gcc, intel, tcp,
> >>>>udp, mpi), to compare them using converse/pingpong,
> >>>>although I'm having trouble building the mpi version,
> >>>>I haven't found an example on how to do it, and all I get is:
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>./build charmm++ mpi-linux icc --libdir="/opt/lam/intel/lib"
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>--incdir="/opt/lam/intel/include"
> >>>>Selected Compiler: icc
> >>>>Selected Options:
> >>>>Copying src/scripts/Makefile to mpi-linux-icc/tmp
> >>>>Soft-linking over bin
> >>>>Soft-linking over lib
> >>>>Soft-linking over lib_so
> >>>>Soft-linking over include
> >>>>Soft-linking over tmp
> >>>>Generating mpi-linux-icc/tmp/conv-mach-pre.sh
> >>>>Performing 'make charmm++ OPTS=' in mpi-linux-icc/tmp
> >>>>make: *** No rule to make target `charmm++'. Stop.
> >>>>-------------------------------------------------
> >>>>Charm++ NOT BUILT. Either cd into mpi-linux-icc/tmp and try
> >>>>
> >>>>any help will be appreciated!
> >>>>
> >>>>Thanks again.
> >>>>
> >>>>
> >>>>On Wed, 02 Mar 2005 21:42:39 -0600, Gengbin Zheng <gzheng_at_ks.uiuc.edu> wrote:
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>Hi Mauricio,
> >>>>>
> >>>>>With NAMD-tcp version, Charm deoes not compiled on top of MPI, the
> >>>>>communication is based on native TCP socket, that is Charm++ itself
> >>>>>implements its message passing function using TCP sockets.
> >>>>>I can not provide a reason to explain why the scaling is so bad, because
> >>>>>I don't think it should behave like that.
> >>>>>You can do some test running Charm pingpong tests (available at
> >>>>>charm/pgms/converse/pingpong), and see what's the pingpong one way
> >>>>>latency is to compare with MPI.
> >>>>>
> >>>>>In fact, I recommend you compile a UDP socket version of charm and NAMD
> >>>>>
> >>>>>
> >>>>>from source as comparison. (it is net-linux version of charm, and
> >>>>
> >>>>
> >>>>>Linux-i686 version of NAMD).
> >>>>>We have seen NAMD running with good scaling with gigabit ethernet.
> >>>>>
> >>>>>Gengbin
> >>>>>
> >>>>>Mauricio Carrillo Tripp wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>>Hi,
> >>>>>>
> >>>>>>Some time ago I started using NAMD on a 16 node cluster
> >>>>>>with good results. I downloaded the executables (tcp version,
> >>>>>>the recomended one for gigabit network) and everything
> >>>>>>ran smoothly. The speed up was good (see Fig 1 at
> >>>>>>http://chem.acad.wabash.edu/~trippm/Clusters/performance.php),
> >>>>>>although maybe it could be improved, which takes me two the real
> >>>>>>issue: we got a new cluster, I did the same as before, but
> >>>>>>I noticed that the simulations were running a lot slower.
> >>>>>>I did the same analysis as I did with the old cluster
> >>>>>>and I found that the speed up was just terrible. I tried the other
> >>>>>>executable version of NAMD2.5 and things went a little better
> >>>>>>but not quite as good as I know the could've been (see Fig 2 at
> >>>>>>http://chem.acad.wabash.edu/~trippm/Clusters/performance.php).
> >>>>>>I also found a big difference between MPICH and LAM/MPI. The
> >>>>>>latter is the only MPI library installed in the old cluster.
> >>>>>>So, these results clearly show that the problem lays in the communication
> >>>>>>(cpu speed up is good)
> >>>>>>and they suggest that charm++ is behaving as MPICH (or worst),
> >>>>>>but I don't know the details of how charm++ works, i.e., does it
> >>>>>>rely on the MPI libraries? if so, how can I tell it which one to use?
> >>>>>>If not, how can I optimize its performance? (Is there a way to measure
> >>>>>>it in a similar way as NetPIPE does?). I would like to take the maximum
> >>>>>>advantage when running on 32 processors...
> >>>>>>
> >>>>>>Any advice will be appreciated. Thanks.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>--
> >>Mauricio Carrillo Tripp, PhD
> >>Department of Chemistry
> >>Wabash College
> >>trippm_at_wabash.edu
> >>
> >>
> >>
> >
> >
> >
> >
>

-- 
Mauricio Carrillo Tripp, PhD
Department of Chemistry
Wabash College
trippm_at_wabash.edu
http://chem.acad.wabash.edu/~trippm
No sig.

Next message: Edward Patrick Obrien: "Trouble with NAMD on Myrinet"
Previous message: Mauricio Carrillo Tripp: "Re: Not getting good speed up"
Maybe in reply to: Gengbin Zheng: "Re: Not getting good speed up"
Next in thread: Mauricio Carrillo Tripp: "Re: Not getting good speed up"
Reply: Mauricio Carrillo Tripp: "Re: Not getting good speed up"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:40:36 CST