AW: AW: AW: Running NAMD on Forge (CUDA)

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Mon Jul 16 2012 - 02:11:45 CDT

Hi again,

I use stepspercycle 20 and fullelectfrequency 4.

Also I get about 6x speedup compared to cpu only. (Two Tesla c2060 per node
and each shared by 6 cores)

Also see, that 6 GPUs per node is not optimal configuration as the PCI-E
bandwidth is shared by all GPUs.

Norman Geist.

> -----Ursprüngliche Nachricht-----
> Von: Gianluca Interlandi [mailto:gianluca_at_u.washington.edu]
> Gesendet: Sonntag, 15. Juli 2012 03:59
> An: Norman Geist
> Cc: Namd Mailing List
> Betreff: Re: AW: AW: namd-l: Running NAMD on Forge (CUDA)
>
> Hi Norman,
>
> > Ok, then it's 1 I guess. This is bad for GPU simulations as the
> > electrostatic is done on the cpu. This causes much traffic between
> cpu and
> > gpu and messes up the PCI-E. Additionally 6 GPU's I could imagine do
> also
> > need a lot of PCI-E bandwidth, so it's likely that the performance of
> the
> > GPUs is not as expected. You should try to set fullelectfrequency to
> at
> > least 4 and try out the new molly parameter. This should cause less
> traffic
> > on PCI-E and improve the GPUs utilization but does little harm the
> energy
> > conservation what shows up as slightly increasing temperature. But
> with the
> > molly parameter it should be ok I think.
>
> I followed your recommendation. Now it runs almost twice as fast on 6
> CUDA
> compared with the configuration without molly and no multistepping. I
> get
> 0.06 sec/step (versus 0.1 sec/step). On the other hand, running on the
> 16
> CPUs with the same configuration takes 0.12 sec/step. So, I get a speed
> up
> of 2x with CUDA (6 CUDA vs 16 CPU cores). As a comparision, I get 0.08
> sec/step on 4 CUDA devices and 0.14 sec/step on 2 devices, 0.25
> sec/step
> on 1 device.
>
> To be honest, I was expecting a lot more from CUDA. It seems that one
> M2070 (0.25 sec/step) is almost equivalent to the performace of one 8-
> core
> magny cours CPU (0.22 sec/step). Or maybe it's just because CPU
> manufacturers have caught up, as I already mentioned.
>
> Gianluca
>
> >>> How many GPUs are there per node in this cluster?
> >>
> >> 6
> >>
> >>> What kind of interconnect?
> >>
> >> Infiniband.
> >
> > Please make sure if you are running over multiple nodes, that you
> make use
> > of the infiniband interconnect. Therefore you need a ibverbs binary
> of NAMD
> > or there must be IPoIB installed. You can see if IPoIB is working if
> there
> > is a ib0 interface for example when you do ifconfig. Also as I
> observed,
> > IPoIB should be configured with the connected mode and a mtu of about
> 65520
> > (cat /sys/class/net/ib0/mode or mtu to see the current settings)
> >
> >>
> >> Here are all specs:
> >>
> >>
> http://www.ncsa.illinois.edu/UserInfo/Resources/Hardware/DellNVIDIAClus
> >> ter/TechSummary/index.html
> >>
> >> Thanks,
> >>
> >> Gianluca
> >>
> >>> Norman Geist.
> >>>
> >>>> -----Ursprüngliche Nachricht-----
> >>>> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> >>>> Auftrag von Gianluca Interlandi
> >>>> Gesendet: Freitag, 13. Juli 2012 00:26
> >>>> An: Aron Broom
> >>>> Cc: NAMD list
> >>>> Betreff: Re: namd-l: Running NAMD on Forge (CUDA)
> >>>>
> >>>> Yes, I was totally surprised, too. I also ran a non-CUDA job on
> >> Forge
> >>>> using 16 CPUs. I got 0.122076 s/step, which is 16% slower than
> using
> >>>> the 6
> >>>> GPUs (0.1 s/step) and a bit slower than 0.10932 s/step that I get
> on
> >>>> Trestles using 16 cores. This difference might be statistical
> >>>> fluctuations
> >>>> though (or configuration setup) since Forge and Trestles have the
> >> exact
> >>>> same CPU, i.e., eight-core 2.4 GHz Magny-Cours.
> >>>>
> >>>> Yes, Forge also uses NVIDIA M2070.
> >>>>>> I keep thinking of this guy here in Seattle who works for NVIDIA
> >>>> downtown
> >>>> and a few years ago he asked me: "How come you don't use CUDA?"
> >> Maybe
> >>>> the
> >>>> code still needs some optimization, and CPU manufacturers have
> been
> >>>> doing
> >>>> everything to catch up.
> >>>>
> >>>> Gianluca
> >>>>
> >>>> On Thu, 12 Jul 2012, Aron Broom wrote:
> >>>>
> >>>>> So your speed for 1 or 2 GPUs (based on what your sent) is about
> >> 1.7
> >>>> ns/day, which
> >>>>> seems decent given the system size.  I was getting 2.0 and 2.6
> >> ns/day
> >>>> for a 100k atom
> >>>>> system with roughly those same parameters (and also 6-cpu cores),
> >> so
> >>>> given a scaling
> >>>>> of ~nlogn, I would expect to see ~1.5 to 2.0 ns/day for you.  So
> in
> >>>> my mind, the
> >>>>> speed you are getting with the GPUs isn't so surprising, it's
> that
> >>>> you get such a
> >>>>> good speed with only the CPUs that shocks me.  In my case I
> didn't
> >>>> see speeds
> >>>>> matching my 1 GPU until 48 CPU cores alone.  Seems like those
> Magny
> >>>> Cours are pretty
> >>>>> awesome.
> >>>>>
> >>>>> Which GPUs are you using?  I was using mainly the M2070s.
> >>>>>
> >>>>> Also, one thing that might be useful, if you are able to get
> >> roughly
> >>>> the same speed
> >>>>> with 6 cores and 2 GPUs and you get with 16 cores alone, is to
> test
> >>>> running 3 jobs at
> >>>>> once, with 5 cores and 2 GPUs assigned to each and see how much
> >>>> slowdown there is.
> >>>>> You might be able to benefit from various replica techniques more
> >>>> than just hitting a
> >>>>> single job with more power.
> >>>>>
> >>>>> Still, the overall conclusion from what you've got seems to be
> that
> >>>> it makes more
> >>>>> sense to go with more of those CPUs rather than putting GPUs in
> >>>> there.
> >>>>>
> >>>>> ~Aron
> >>>>>
> >>>>> On Thu, Jul 12, 2012 at 4:58 PM, Gianluca Interlandi
> >>>> <gianluca_at_u.washington.edu>
> >>>>> wrote:
> >>>>> What are your simulation parameters:
> >>>>>
> >>>>> timestep (and also any multistepping values)
> >>>>>
> >>>>> 2 fs, SHAKE, no multistepping
> >>>>>
> >>>>> cutoff (and also the pairlist and PME grid spacing)
> >>>>>
> >>>>> 8-10-12  PME grid spacing ~ 1 A
> >>>>>
> >>>>> Have you tried giving it just 1 or 2 GPUs alone (using the
> >>>>> +devices)?
> >>>>>
> >>>>>
> >>>>> Yes, this is the benchmark time:
> >>>>>
> >>>>> np 1:  0.48615 s/step
> >>>>> np 2:  0.26105 s/step
> >>>>> np 4:  0.14542 s/step
> >>>>> np 6:  0.10167 s/step
> >>>>>
> >>>>> I post here also part of the log running on 6 devices (in case it
> >> is
> >>>> helpful to
> >>>>> localize the problem):
> >>>>>
> >>>>> Pe 4 has 57 local and 64 remote patches and 1066 local and 473
> >> remote
> >>>> computes.
> >>>>> Pe 1 has 57 local and 65 remote patches and 1057 local and 482
> >> remote
> >>>> computes.
> >>>>> Pe 5 has 57 local and 56 remote patches and 1150 local and 389
> >> remote
> >>>> computes.
> >>>>> Pe 2 has 57 local and 57 remote patches and 1052 local and 487
> >> remote
> >>>> computes.
> >>>>> Pe 3 has 58 local and 57 remote patches and 1079 local and 487
> >> remote
> >>>> computes.
> >>>>> Pe 0 has 57 local and 57 remote patches and 1144 local and 395
> >> remote
> >>>> computes.
> >>>>>
> >>>>> Gianluca
> >>>>>
> >>>>>       Gianluca
> >>>>>
> >>>>>       On Thu, 12 Jul 2012, Aron Broom wrote:
> >>>>>
> >>>>>             have you tried the multicore build?  I wonder
> if
> >>>> the
> >>>>> prebuilt
> >>>>>             smp one is just not
> >>>>>             working for you.
> >>>>>
> >>>>>             On Thu, Jul 12, 2012 at 3:21 PM, Gianluca
> >>>> Interlandi
> >>>>>             <gianluca_at_u.washington.edu>
> >>>>>             wrote:
> >>>>>                         are other people also using those
> >> GPUs?
> >>>>>
> >>>>>
> >>>>>             I don't think so since I reserved the entire
> >> node.
> >>>>>
> >>>>>                   What are the benchmark timings that you
> are
> >>>> given
> >>>>> after
> >>>>>             ~1000
> >>>>>                   steps?
> >>>>>
> >>>>>
> >>>>>             The benchmark time with 6 processes is 101 sec
> >> for
> >>>> 1000
> >>>>>             steps. This is only
> >>>>>             slightly faster than Trestles where I get 109
> sec
> >>>> for
> >>>>> 1000
> >>>>>             steps running on 16
> >>>>>             CPUs. So, yes 6 GPUs on Forge are much faster
> >> than
> >>>> 6
> >>>>> cores on
> >>>>>             Trestles, but in
> >>>>>             terms of SUs it makes no difference, since on
> >> Forge
> >>>> I
> >>>>> still
> >>>>>             have to reserve the
> >>>>>             entire node (16 cores).
> >>>>>
> >>>>>             Gianluca
> >>>>>
> >>>>>                   is some setup time.
> >>>>>
> >>>>>                   I often run a system of ~100,000 atoms,
> and
> >> I
> >>>>> generally
> >>>>>             see an
> >>>>>                   order of magnitude
> >>>>>                   improvement in speed compared to the same
> >>>> number
> >>>>> of
> >>>>>             cores without
> >>>>>                   the GPUs.  I would
> >>>>>                   test the non-CUDA precompiled cude on
> your
> >>>> Forge
> >>>>> system
> >>>>>             and see how
> >>>>>                   that compares, it
> >>>>>                   might be the fault of something other
> than
> >>>> CUDA.
> >>>>>
> >>>>>                   ~Aron
> >>>>>
> >>>>>                   On Thu, Jul 12, 2012 at 2:41 PM, Gianluca
> >>>>> Interlandi
> >>>>>                   <gianluca_at_u.washington.edu>
> >>>>>                   wrote:
> >>>>>                         Hi Aron,
> >>>>>
> >>>>>                         Thanks for the explanations. I
> don't
> >>>> know
> >>>>> whether
> >>>>>             I'm doing
> >>>>>                   everything
> >>>>>                         right. I don't see any speed
> >> advantage
> >>>>> running on
> >>>>>             the CUDA
> >>>>>                   cluster
> >>>>>                         (Forge) versus running on a non-
> CUDA
> >>>>> cluster.
> >>>>>
> >>>>>                         I did the following benchmarks on
> >> Forge
> >>>>> (the
> >>>>>             system has
> >>>>>                   127,000 atoms and
> >>>>>                         ran for 1000 steps):
> >>>>>
> >>>>>                         np 1:  506 sec
> >>>>>                         np 2:  281 sec
> >>>>>                         np 4:  163 sec
> >>>>>                         np 6:  136 sec
> >>>>>                         np 12: 218 sec
> >>>>>
> >>>>>                         On the other hand, running the same
> >>>> system
> >>>>> on 16
> >>>>>             cores of
> >>>>>                   Trestles (AMD
> >>>>>                         Magny Cours) takes 129 sec. It
> seems
> >>>> that
> >>>>> I'm not
> >>>>>             really
> >>>>>                   making good use
> >>>>>                         of SUs by running on the CUDA
> >> cluster.
> >>>> Or,
> >>>>> maybe
> >>>>>             I'm doing
> >>>>>                   something
> >>>>>                         wrong? I'm using the ibverbs-smp-
> CUDA
> >>>>>             pre-compiled version of
> >>>>>                   NAMD 2.9.
> >>>>>
> >>>>>                         Thanks,
> >>>>>
> >>>>>                              Gianluca
> >>>>>
> >>>>>                         On Tue, 10 Jul 2012, Aron Broom
> >> wrote:
> >>>>>
> >>>>>                               if it is truly just one node,
> >> you
> >>>> can
> >>>>> use
> >>>>>             the
> >>>>>                   multicore-CUDA
> >>>>>                               version and avoid the
> >>>>>                               MPI charmrun stuff.  Still,
> it
> >>>> boils
> >>>>> down
> >>>>>             to much the
> >>>>>                   same
> >>>>>                               thing I think.  If you do
> >>>>>                               what you've done below, you
> are
> >>>>> running one
> >>>>>             job with 12
> >>>>>                   CPU
> >>>>>                               cores and all GPUs.  If
> >>>>>                               you don't specify the
> +devices,
> >>>> NAMD
> >>>>> will
> >>>>>             automatically
> >>>>>                   find
> >>>>>                               the available GPUs, so I
> >>>>>                               think the main benefit of
> >>>> specifying
> >>>>> them
> >>>>>             is when you
> >>>>>                   are
> >>>>>                               running more than one job
> >>>>>                               and don't want the jobs
> sharing
> >>>> GPUs.
> >>>>>
> >>>>>                               I'm not sure you'll see great
> >>>> scaling
> >>>>>             across 6 GPUs for
> >>>>>                   a
> >>>>>                               single job, but that would
> >>>>>                               be great if you did.
> >>>>>
> >>>>>                               ~Aron
> >>>>>
> >>>>>                               On Tue, Jul 10, 2012 at 1:14
> >> PM,
> >>>>> Gianluca
> >>>>>             Interlandi
> >>>>>                               <gianluca_at_u.washington.edu>
> >>>>>                               wrote:
> >>>>>                                     Hi,
> >>>>>
> >>>>>                                     I have a question
> >>>> concerning
> >>>>> running
> >>>>>             NAMD on a
> >>>>>                   CUDA
> >>>>>                               cluster.
> >>>>>
> >>>>>                                     NCSA Forge has for
> >> example
> >>>> 6
> >>>>> CUDA
> >>>>>             devices and 16
> >>>>>                   CPU
> >>>>>                               cores per node. If I
> >>>>>                                     want to use all 6 CUDA
> >>>> devices
> >>>>> in a
> >>>>>             node, how
> >>>>>                   many
> >>>>>                               processes is it
> >>>>>                                     recommended to spawn?
> Do
> >> I
> >>>> need
> >>>>> to
> >>>>>             specify
> >>>>>                   "+devices"?
> >>>>>
> >>>>>                                     So, if for example I
> want
> >>>> to
> >>>>> spawn 12
> >>>>>             processes,
> >>>>>                   do I
> >>>>>                               need to specify:
> >>>>>
> >>>>>                                     charmrun +p12 -
> >> machinefile
> >>>>>             $PBS_NODEFILE +devices
> >>>>>                               0,1,2,3,4,5 namd2
> >>>>>                                     +idlepoll
> >>>>>
> >>>>>                                     Thanks,
> >>>>>
> >>>>>                                          Gianluca
> >>>>>
> >>>>>
> >>>>>
> >>>>> -----------------------------------------------------
> >>>>>                                     Gianluca Interlandi,
> PhD
> >>>>>                   gianluca_at_u.washington.edu
> >>>>>                                                         +1
> >>>> (206)
> >>>>> 685 4435
> >>>>>
> >>>>>
> >>>>> http://artemide.bioeng.washington.edu/
> >>>>>
> >>>>>                                     Research Scientist at
> the
> >>>>> Department
> >>>>>             of
> >>>>>                   Bioengineering
> >>>>>                                     at the University of
> >>>>> Washington,
> >>>>>             Seattle WA
> >>>>>                   U.S.A.
> >>>>>
> >>>>>
> >>>>> -----------------------------------------------------
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>                               --
> >>>>>                               Aron Broom M.Sc
> >>>>>                               PhD Student
> >>>>>                               Department of Chemistry
> >>>>>                               University of Waterloo
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>             -----------------------------------------------
> --
> >> --
> >>>> --
> >>>>>                         Gianluca Interlandi, PhD
> >>>>>             gianluca_at_u.washington.edu
> >>>>>                                             +1 (206) 685
> 4435
> >>>>>
> >>>>>             http://artemide.bioeng.washington.edu/
> >>>>>
> >>>>>                         Research Scientist at the
> Department
> >> of
> >>>>>             Bioengineering
> >>>>>                         at the University of Washington,
> >>>> Seattle WA
> >>>>>             U.S.A.
> >>>>>
> >>>>>             -----------------------------------------------
> --
> >> --
> >>>> --
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>                   --
> >>>>>                   Aron Broom M.Sc
> >>>>>                   PhD Student
> >>>>>                   Department of Chemistry
> >>>>>                   University of Waterloo
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>             -----------------------------------------------
> --
> >> --
> >>>> --
> >>>>>             Gianluca Interlandi, PhD
> >> gianluca_at_u.washington.edu
> >>>>>                                 +1 (206) 685 4435
> >>>>>
> >>>>> http://artemide.bioeng.washington.edu/
> >>>>>
> >>>>>             Research Scientist at the Department of
> >>>> Bioengineering
> >>>>>             at the University of Washington, Seattle WA
> >> U.S.A.
> >>>>>             -----------------------------------------------
> --
> >> --
> >>>> --
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>             --
> >>>>>             Aron Broom M.Sc
> >>>>>             PhD Student
> >>>>>             Department of Chemistry
> >>>>>             University of Waterloo
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>       -----------------------------------------------------
> >>>>>       Gianluca Interlandi, PhD gianluca_at_u.washington.edu
> >>>>>                           +1 (206) 685 4435
> >>>>>
> >>>> http://artemide.bioeng.washington.edu/
> >>>>>
> >>>>>       Research Scientist at the Department of
> Bioengineering
> >>>>>       at the University of Washington, Seattle WA U.S.A.
> >>>>>       -----------------------------------------------------
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Aron Broom M.Sc
> >>>>> PhD Student
> >>>>> Department of Chemistry
> >>>>> University of Waterloo
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> -----------------------------------------------------
> >>>>> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
> >>>>>                     +1 (206) 685 4435
> >>>>>                     http://artemide.bioeng.washington.edu/
> >>>>>
> >>>>> Research Scientist at the Department of Bioengineering
> >>>>> at the University of Washington, Seattle WA U.S.A.
> >>>>> -----------------------------------------------------
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Aron Broom M.Sc
> >>>>> PhD Student
> >>>>> Department of Chemistry
> >>>>> University of Waterloo
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>> -----------------------------------------------------
> >>>> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
> >>>> +1 (206) 685 4435
> >>>> http://artemide.bioeng.washington.edu/
> >>>>
> >>>> Research Scientist at the Department of Bioengineering
> >>>> at the University of Washington, Seattle WA U.S.A.
> >>>> -----------------------------------------------------
> >>>
> >>>
> >>
> >> -----------------------------------------------------
> >> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
> >> +1 (206) 685 4435
> >> http://artemide.bioeng.washington.edu/
> >>
> >> Research Scientist at the Department of Bioengineering
> >> at the University of Washington, Seattle WA U.S.A.
> >> -----------------------------------------------------
> >
> >
>
> -----------------------------------------------------
> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
> +1 (206) 685 4435
> http://artemide.bioeng.washington.edu/
>
> Research Scientist at the Department of Bioengineering
> at the University of Washington, Seattle WA U.S.A.
> -----------------------------------------------------

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:47 CST