Re: Not getting good speed up

From: Dow Hurst (Dow.Hurst_at_mindspring.com)
Date: Sun Mar 20 2005 - 02:22:00 CST

Well, the specs on the Cisco 2984 show it is not wire speed non blocking
since it has 12Gbps bandwidth with burstable wirespeed only for short
durations. The Dell can sustain nonblocking wirespeed on all ports.
Plus UDP packets if received and not dropped actually are bit more
efficient since NAMD doesn't require a reply packet unless a packet is
lost. So the TCP handshake does cost you a bit more bandwidth that you
trade for the reliability. I think the Cisco switch is good for what it
is designed for, which is a office environment with loads that vary
quickly in time as people request web pages with lots of content. It
isn't really designed for a steady saturated load on all ports. I think
that is what is happening here. The UDP performs better due to the
Cisco not dropping as many packets or maybe I should say that your more
efficiently using the bandwidth available. Anyone have a different
thought on the situation? I've noticed that Ariel, the latest cluster
on the NAMD computing resources page, has a SMC 8624T TigerSwitch that
is wirespeed nonblocking for all ports. Has 48Gbps backplane bandwidth
and also supports 9K Jumbo packets. That might be an alternative if you
can return the Cisco 2984. Why are the locally compiled binaries not
performing as well as the downloaded binary? What compiler have you
used? Would the difference be the compiler itself or compiler options?
Dow

Mauricio Carrillo Tripp wrote:

>Hi again!
>Here's something to think about during the weekend, another couple of
>pieces to the puzzle:
>
>I ran the ApoA1 Benchmark on my new cluster LGRocks using either the
>Dell switch or the Cisco switch, here are the results:
>http://chem.acad.wabash.edu/~trippm/Clusters/performance.php#fig5
>I'm getting pretty much the same results as everybody else when using the
>Dell switch, no surprise there, but when using the problematic Cisco switch
>the scaling improves, considerably well for the case of the compiled UDP
>version of namd2.
>
>Then what I did was to run the molecular system I've been using for the
>tests ~20K atoms, but using the SAME configuration as the ApoA1
>benchmark (obviously,
>changing structure, coordinates and cell and PME dimensions; everything else
>left the same). When using the Dell switch I lose a little bit of performance
>compared to 92K atoms, but when I use the Cisco switch everything goes to
>hell again!
>http://chem.acad.wabash.edu/~trippm/Clusters/performance.php#fig6
>
>So, there is a clear system size dependency of the scaling, most
>notably when using
>the Cisco switch: The bigger the system, the better the scaling...
>
>
>
>On Tue, 15 Mar 2005 13:31:11 -0500 (EST), Dow_Hurst
><dhurst_at_mindspring.com> wrote:
>
>
>>Well, I have read that some GigE NICs and GigE switches do not have good latency while some 100Mb switches and NICs would demonstrate better latency than GigE hardware. The low bandwidth of the current Cisco switch at 12Gbps would mean that only a few ports could burst to fullduplex wire speed at a time. Maybe your running halfduplex by default with that switch as well. Some switches can tell you which ports are full duplex or halfduplex. Becker's tools for analyzing ethernet cards from the scyld website can tell you what your local NICs are set at, and you can force full duplex with driver switches. I don't know enough about how charm++ works yet to really comment more. ;-) It is good your doing such a thorough job documenting this phenomena because how else will anyone know what will work? I sure wish we had on the NAMD wiki a page of ranked hardware for good, exceptional, and poor performance. We will be purchasing in a couple of weeks so I am working hard at id!
>>
>>
> entifying what hardware will work and that I can test on. The jumbo packets idea probably requires each node to also have matching MTUs and the proper kernel level jumbo packet support. I also don't know enough about that to really comment yet. ;-)
>
>
>>One thing I have learned is that the current NAMD for x86 using TCP runs just fine on an Opteron. I haven't yet compiled charm++ or NAMD for the Opteron so can't say what kind of performance increase that might yield. The last Cluster Computing magazine had an article talking about how 32bit code sometimes runs faster on a 64bit CPU/kernel due how the code fits into the registers. It was a bit over my head but was encouraging. My current specs for a two node dual 246 2.0GHz Opteron test cluster over GigE showed the apoa1 benchmark following the dual 3.06GHz Xeon line exactly for 1-4 CPUs that is on the NAMD performance benchmark graph. I have no other scaling data to add since the cluster is so small.
>>Best wishes,
>>Dow
>>
>>
>>-----Original Message-----
>>From: Mauricio Carrillo Tripp <trippm_at_gmail.com>
>>Sent: Mar 15, 2005 11:55 AM
>>To: Dow Hurst <Dow.Hurst_at_mindspring.com>
>>Cc: fellers_at_wabash.edu, namd-l_at_ks.uiuc.edu
>>Subject: Re: namd-l: Not getting good speed up
>>
>>Thank you for your suggestions,
>>
>>I disabled completely the QoS on the switch (that would take care of the
>>two queues you mention?). The default MTU is 1500, but I changed this
>>value to lesser (750) or grater (up to 9198, jumbo size) values.
>>None of these changes had an effect on the scaling whatsoever, still
>>really bad.
>>We don't think it's a bandwidth issue, since we did a test with an older
>>switch (Cisco Catalyst 2900, 100TX) with a much lower bandwidth, and the scaling
>>was good (See Fig 4 at
>>http://chem.acad.wabash.edu/~trippm/Clusters/performance.php).
>>
>>We're still clueless...
>>
>>On Sun, 13 Mar 2005 15:59:38 -0500, Dow Hurst <Dow.Hurst_at_mindspring.com> wrote:
>>
>>
>>>You can look to see if QoS is limiting your rate in the Cisco switch. I
>>>can't see from the link you provided what the defaults are on the Cisco
>>>QoS settings. Maybe your getting limited that way. It does have two
>>>queues for rate limiting so you'd need to check both. Also, the cheaper
>>>switch has a higher bandwidth from what I can tell. I wonder if you
>>>could use a jumbo packet enabled switch if you could do even better.
>>>The MTU on the Cisco is 1522. Can't tell from the links on the Dell
>>>what it supports as max MTU.
>>>Dow
>>>
>>>
>>>Mauricio Carrillo Tripp wrote:
>>>
>>>
>>>
>>>>Hi again,
>>>>
>>>>I've tried a lot of things; turning off the internal firewall on the
>>>>compute nodes,
>>>>turning ITR off, compiling my own namd2 versions (tcp, udp, mpi),
>>>>and nothing helped to considerably improve the scaling.
>>>>
>>>>As my last resource, I swapped the switches, and to my surprise,
>>>>the cheap switch (Dell) gave me the scaling I was expecting, in comparison
>>>>with the crappy scaling the expensive switch (Cisco) is giving (look
>>>>at Fig. 3 at
>>>>http://chem.acad.wabash.edu/~trippm/Clusters/performance.php).
>>>>So now I know the problem is in the Cisco switch.
>>>>I've also tried disabling spantree, enabling fastport and setting the ports to
>>>>be 1000tx all the time instead of auto detect. Nothing changes.
>>>>
>>>>Any other thoughts or suggestions??
>>>>
>>>>
>>>>
>>>>
>>>>On Thu, 3 Mar 2005 14:44:55 -0500, Mauricio Carrillo Tripp
>>>><trippm_at_gmail.com> wrote:
>>>>
>>>>
>>>>
>>>>
>>>>>sorry about that stupid mistake, it's building now...
>>>>>
>>>>>I tried the option +giga (and also +strategy USE_MESH and +strategy USE_GRID)
>>>>>All of them improved a little the scaling, but not enough though.
>>>>>It is true that the new cluster is using Rocks, and that's why I want
>>>>>to compare the
>>>>>behaviour off all different versions of charm++...
>>>>>I'll keep you all posted on my findings...
>>>>>Thanks.
>>>>>
>>>>>
>>>>>On Thu, 03 Mar 2005 13:34:56 -0600, Gengbin Zheng <gzheng_at_ks.uiuc.edu> wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>For UDP namd, use command line option: "+giga" which turned on some
>>>>>>setting (such as UDP window protocol settings) which could speedup the
>>>>>>performance. You can give it a try.
>>>>>>
>>>>>>I noticed that the OS of your new cluster is Rock. I am not familar with
>>>>>>it, but I assume it has something different in the way it launch a
>>>>>>parallel job.
>>>>>>
>>>>>>The error I have when building charm:
>>>>>>
>>>>>>make: *** No rule to make target `charmm++'. Stop.
>>>>>>-------------------------------------------------
>>>>>>Charm++ NOT BUILT. Either cd into mpi-linux-icc/tmp and try
>>>>>>
>>>>>>this is because you misspelled charm++ to charmm++.
>>>>>>
>>>>>>Gengbin
>>>>>>
>>>>>>Mauricio Carrillo Tripp wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>Hi Gengbin, thanks for your answer. I did the comparison you recommend,
>>>>>>>TCP vs UDP (I didn't compile from source though, I used the
>>>>>>>executables NAMD supplies). The results are on Fig 2 at
>>>>>>>http://chem.acad.wabash.edu/~trippm/Clusters/performance.php.
>>>>>>>Indeed, I get an increase in performance but not good enough.
>>>>>>>Using the TCP version on the old cluster (lg66) did show good scaling,
>>>>>>>but that's not the case for the new cluster (lgrocks).
>>>>>>>Any ideas why is this, anybody?
>>>>>>>
>>>>>>>I'm trying to compile different versions of charm++ (gcc, intel, tcp,
>>>>>>>udp, mpi), to compare them using converse/pingpong,
>>>>>>>although I'm having trouble building the mpi version,
>>>>>>>I haven't found an example on how to do it, and all I get is:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>./build charmm++ mpi-linux icc --libdir="/opt/lam/intel/lib"
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>--incdir="/opt/lam/intel/include"
>>>>>>>Selected Compiler: icc
>>>>>>>Selected Options:
>>>>>>>Copying src/scripts/Makefile to mpi-linux-icc/tmp
>>>>>>>Soft-linking over bin
>>>>>>>Soft-linking over lib
>>>>>>>Soft-linking over lib_so
>>>>>>>Soft-linking over include
>>>>>>>Soft-linking over tmp
>>>>>>>Generating mpi-linux-icc/tmp/conv-mach-pre.sh
>>>>>>>Performing 'make charmm++ OPTS=' in mpi-linux-icc/tmp
>>>>>>>make: *** No rule to make target `charmm++'. Stop.
>>>>>>>-------------------------------------------------
>>>>>>>Charm++ NOT BUILT. Either cd into mpi-linux-icc/tmp and try
>>>>>>>
>>>>>>>any help will be appreciated!
>>>>>>>
>>>>>>>Thanks again.
>>>>>>>
>>>>>>>
>>>>>>>On Wed, 02 Mar 2005 21:42:39 -0600, Gengbin Zheng <gzheng_at_ks.uiuc.edu> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>Hi Mauricio,
>>>>>>>>
>>>>>>>>With NAMD-tcp version, Charm deoes not compiled on top of MPI, the
>>>>>>>>communication is based on native TCP socket, that is Charm++ itself
>>>>>>>>implements its message passing function using TCP sockets.
>>>>>>>>I can not provide a reason to explain why the scaling is so bad, because
>>>>>>>>I don't think it should behave like that.
>>>>>>>>You can do some test running Charm pingpong tests (available at
>>>>>>>>charm/pgms/converse/pingpong), and see what's the pingpong one way
>>>>>>>>latency is to compare with MPI.
>>>>>>>>
>>>>>>>>In fact, I recommend you compile a UDP socket version of charm and NAMD
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>from source as comparison. (it is net-linux version of charm, and
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>Linux-i686 version of NAMD).
>>>>>>>>We have seen NAMD running with good scaling with gigabit ethernet.
>>>>>>>>
>>>>>>>>Gengbin
>>>>>>>>
>>>>>>>>Mauricio Carrillo Tripp wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>Hi,
>>>>>>>>>
>>>>>>>>>Some time ago I started using NAMD on a 16 node cluster
>>>>>>>>>with good results. I downloaded the executables (tcp version,
>>>>>>>>>the recomended one for gigabit network) and everything
>>>>>>>>>ran smoothly. The speed up was good (see Fig 1 at
>>>>>>>>>http://chem.acad.wabash.edu/~trippm/Clusters/performance.php),
>>>>>>>>>although maybe it could be improved, which takes me two the real
>>>>>>>>>issue: we got a new cluster, I did the same as before, but
>>>>>>>>>I noticed that the simulations were running a lot slower.
>>>>>>>>>I did the same analysis as I did with the old cluster
>>>>>>>>>and I found that the speed up was just terrible. I tried the other
>>>>>>>>>executable version of NAMD2.5 and things went a little better
>>>>>>>>>but not quite as good as I know the could've been (see Fig 2 at
>>>>>>>>>http://chem.acad.wabash.edu/~trippm/Clusters/performance.php).
>>>>>>>>>I also found a big difference between MPICH and LAM/MPI. The
>>>>>>>>>latter is the only MPI library installed in the old cluster.
>>>>>>>>>So, these results clearly show that the problem lays in the communication
>>>>>>>>>(cpu speed up is good)
>>>>>>>>>and they suggest that charm++ is behaving as MPICH (or worst),
>>>>>>>>>but I don't know the details of how charm++ works, i.e., does it
>>>>>>>>>rely on the MPI libraries? if so, how can I tell it which one to use?
>>>>>>>>>If not, how can I optimize its performance? (Is there a way to measure
>>>>>>>>>it in a similar way as NetPIPE does?). I would like to take the maximum
>>>>>>>>>advantage when running on 32 processors...
>>>>>>>>>
>>>>>>>>>Any advice will be appreciated. Thanks.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>--
>>>>>Mauricio Carrillo Tripp, PhD
>>>>>Department of Chemistry
>>>>>Wabash College
>>>>>trippm_at_wabash.edu
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>--
>>
>>Mauricio Carrillo Tripp, PhD
>>Department of Chemistry
>>Wabash College
>>trippm_at_wabash.edu
>>http://chem.acad.wabash.edu/~trippm
>>
>>No sig.
>>
>>
>>
>
>
>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 05:18:39 CST