AW: AW: Running NAMD on Forge (CUDA)

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Fri Jul 13 2012 - 02:09:04 CDT

> -----Ursprüngliche Nachricht-----
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> Auftrag von Gianluca Interlandi
> Gesendet: Freitag, 13. Juli 2012 08:36
> An: Norman Geist
> Cc: Namd Mailing List
> Betreff: Re: AW: namd-l: Running NAMD on Forge (CUDA)
>
> Hi Norman,

Hi,

>
> > What value do you use for fullelectfequency??
>
> The default. I haven't set it.

Ok, then it's 1 I guess. This is bad for GPU simulations as the
electrostatic is done on the cpu. This causes much traffic between cpu and
gpu and messes up the PCI-E. Additionally 6 GPU's I could imagine do also
need a lot of PCI-E bandwidth, so it's likely that the performance of the
GPUs is not as expected. You should try to set fullelectfrequency to at
least 4 and try out the new molly parameter. This should cause less traffic
on PCI-E and improve the GPUs utilization but does little harm the energy
conservation what shows up as slightly increasing temperature. But with the
molly parameter it should be ok I think.

>
> > How many GPUs are there per node in this cluster?
>
> 6
>
> > What kind of interconnect?
>
> Infiniband.

Please make sure if you are running over multiple nodes, that you make use
of the infiniband interconnect. Therefore you need a ibverbs binary of NAMD
or there must be IPoIB installed. You can see if IPoIB is working if there
is a ib0 interface for example when you do ifconfig. Also as I observed,
IPoIB should be configured with the connected mode and a mtu of about 65520
(cat /sys/class/net/ib0/mode or mtu to see the current settings)

>
> Here are all specs:
>
> http://www.ncsa.illinois.edu/UserInfo/Resources/Hardware/DellNVIDIAClus
> ter/TechSummary/index.html
>
> Thanks,
>
> Gianluca
>
> > Norman Geist.
> >
> >> -----Ursprüngliche Nachricht-----
> >> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> >> Auftrag von Gianluca Interlandi
> >> Gesendet: Freitag, 13. Juli 2012 00:26
> >> An: Aron Broom
> >> Cc: NAMD list
> >> Betreff: Re: namd-l: Running NAMD on Forge (CUDA)
> >>
> >> Yes, I was totally surprised, too. I also ran a non-CUDA job on
> Forge
> >> using 16 CPUs. I got 0.122076 s/step, which is 16% slower than using
> >> the 6
> >> GPUs (0.1 s/step) and a bit slower than 0.10932 s/step that I get on
> >> Trestles using 16 cores. This difference might be statistical
> >> fluctuations
> >> though (or configuration setup) since Forge and Trestles have the
> exact
> >> same CPU, i.e., eight-core 2.4 GHz Magny-Cours.
> >>
> >> Yes, Forge also uses NVIDIA M2070.
> >>>> I keep thinking of this guy here in Seattle who works for NVIDIA
> >> downtown
> >> and a few years ago he asked me: "How come you don't use CUDA?"
> Maybe
> >> the
> >> code still needs some optimization, and CPU manufacturers have been
> >> doing
> >> everything to catch up.
> >>
> >> Gianluca
> >>
> >> On Thu, 12 Jul 2012, Aron Broom wrote:
> >>
> >>> So your speed for 1 or 2 GPUs (based on what your sent) is about
> 1.7
> >> ns/day, which
> >>> seems decent given the system size.  I was getting 2.0 and 2.6
> ns/day
> >> for a 100k atom
> >>> system with roughly those same parameters (and also 6-cpu cores),
> so
> >> given a scaling
> >>> of ~nlogn, I would expect to see ~1.5 to 2.0 ns/day for you.  So in
> >> my mind, the
> >>> speed you are getting with the GPUs isn't so surprising, it's that
> >> you get such a
> >>> good speed with only the CPUs that shocks me.  In my case I didn't
> >> see speeds
> >>> matching my 1 GPU until 48 CPU cores alone.  Seems like those Magny
> >> Cours are pretty
> >>> awesome.
> >>>
> >>> Which GPUs are you using?  I was using mainly the M2070s.
> >>>
> >>> Also, one thing that might be useful, if you are able to get
> roughly
> >> the same speed
> >>> with 6 cores and 2 GPUs and you get with 16 cores alone, is to test
> >> running 3 jobs at
> >>> once, with 5 cores and 2 GPUs assigned to each and see how much
> >> slowdown there is.
> >>> You might be able to benefit from various replica techniques more
> >> than just hitting a
> >>> single job with more power.
> >>>
> >>> Still, the overall conclusion from what you've got seems to be that
> >> it makes more
> >>> sense to go with more of those CPUs rather than putting GPUs in
> >> there.
> >>>
> >>> ~Aron
> >>>
> >>> On Thu, Jul 12, 2012 at 4:58 PM, Gianluca Interlandi
> >> <gianluca_at_u.washington.edu>
> >>> wrote:
> >>> What are your simulation parameters:
> >>>
> >>> timestep (and also any multistepping values)
> >>>
> >>> 2 fs, SHAKE, no multistepping
> >>>
> >>> cutoff (and also the pairlist and PME grid spacing)
> >>>
> >>> 8-10-12  PME grid spacing ~ 1 A
> >>>
> >>> Have you tried giving it just 1 or 2 GPUs alone (using the
> >>> +devices)?
> >>>
> >>>
> >>> Yes, this is the benchmark time:
> >>>
> >>> np 1:  0.48615 s/step
> >>> np 2:  0.26105 s/step
> >>> np 4:  0.14542 s/step
> >>> np 6:  0.10167 s/step
> >>>
> >>> I post here also part of the log running on 6 devices (in case it
> is
> >> helpful to
> >>> localize the problem):
> >>>
> >>> Pe 4 has 57 local and 64 remote patches and 1066 local and 473
> remote
> >> computes.
> >>> Pe 1 has 57 local and 65 remote patches and 1057 local and 482
> remote
> >> computes.
> >>> Pe 5 has 57 local and 56 remote patches and 1150 local and 389
> remote
> >> computes.
> >>> Pe 2 has 57 local and 57 remote patches and 1052 local and 487
> remote
> >> computes.
> >>> Pe 3 has 58 local and 57 remote patches and 1079 local and 487
> remote
> >> computes.
> >>> Pe 0 has 57 local and 57 remote patches and 1144 local and 395
> remote
> >> computes.
> >>>
> >>> Gianluca
> >>>
> >>>       Gianluca
> >>>
> >>>       On Thu, 12 Jul 2012, Aron Broom wrote:
> >>>
> >>>             have you tried the multicore build?  I wonder if
> >> the
> >>> prebuilt
> >>>             smp one is just not
> >>>             working for you.
> >>>
> >>>             On Thu, Jul 12, 2012 at 3:21 PM, Gianluca
> >> Interlandi
> >>>             <gianluca_at_u.washington.edu>
> >>>             wrote:
> >>>                         are other people also using those
> GPUs?
> >>>
> >>>
> >>>             I don't think so since I reserved the entire
> node.
> >>>
> >>>                   What are the benchmark timings that you are
> >> given
> >>> after
> >>>             ~1000
> >>>                   steps?
> >>>
> >>>
> >>>             The benchmark time with 6 processes is 101 sec
> for
> >> 1000
> >>>             steps. This is only
> >>>             slightly faster than Trestles where I get 109 sec
> >> for
> >>> 1000
> >>>             steps running on 16
> >>>             CPUs. So, yes 6 GPUs on Forge are much faster
> than
> >> 6
> >>> cores on
> >>>             Trestles, but in
> >>>             terms of SUs it makes no difference, since on
> Forge
> >> I
> >>> still
> >>>             have to reserve the
> >>>             entire node (16 cores).
> >>>
> >>>             Gianluca
> >>>
> >>>                   is some setup time.
> >>>
> >>>                   I often run a system of ~100,000 atoms, and
> I
> >>> generally
> >>>             see an
> >>>                   order of magnitude
> >>>                   improvement in speed compared to the same
> >> number
> >>> of
> >>>             cores without
> >>>                   the GPUs.  I would
> >>>                   test the non-CUDA precompiled cude on your
> >> Forge
> >>> system
> >>>             and see how
> >>>                   that compares, it
> >>>                   might be the fault of something other than
> >> CUDA.
> >>>
> >>>                   ~Aron
> >>>
> >>>                   On Thu, Jul 12, 2012 at 2:41 PM, Gianluca
> >>> Interlandi
> >>>                   <gianluca_at_u.washington.edu>
> >>>                   wrote:
> >>>                         Hi Aron,
> >>>
> >>>                         Thanks for the explanations. I don't
> >> know
> >>> whether
> >>>             I'm doing
> >>>                   everything
> >>>                         right. I don't see any speed
> advantage
> >>> running on
> >>>             the CUDA
> >>>                   cluster
> >>>                         (Forge) versus running on a non-CUDA
> >>> cluster.
> >>>
> >>>                         I did the following benchmarks on
> Forge
> >>> (the
> >>>             system has
> >>>                   127,000 atoms and
> >>>                         ran for 1000 steps):
> >>>
> >>>                         np 1:  506 sec
> >>>                         np 2:  281 sec
> >>>                         np 4:  163 sec
> >>>                         np 6:  136 sec
> >>>                         np 12: 218 sec
> >>>
> >>>                         On the other hand, running the same
> >> system
> >>> on 16
> >>>             cores of
> >>>                   Trestles (AMD
> >>>                         Magny Cours) takes 129 sec. It seems
> >> that
> >>> I'm not
> >>>             really
> >>>                   making good use
> >>>                         of SUs by running on the CUDA
> cluster.
> >> Or,
> >>> maybe
> >>>             I'm doing
> >>>                   something
> >>>                         wrong? I'm using the ibverbs-smp-CUDA
> >>>             pre-compiled version of
> >>>                   NAMD 2.9.
> >>>
> >>>                         Thanks,
> >>>
> >>>                              Gianluca
> >>>
> >>>                         On Tue, 10 Jul 2012, Aron Broom
> wrote:
> >>>
> >>>                               if it is truly just one node,
> you
> >> can
> >>> use
> >>>             the
> >>>                   multicore-CUDA
> >>>                               version and avoid the
> >>>                               MPI charmrun stuff.  Still, it
> >> boils
> >>> down
> >>>             to much the
> >>>                   same
> >>>                               thing I think.  If you do
> >>>                               what you've done below, you are
> >>> running one
> >>>             job with 12
> >>>                   CPU
> >>>                               cores and all GPUs.  If
> >>>                               you don't specify the +devices,
> >> NAMD
> >>> will
> >>>             automatically
> >>>                   find
> >>>                               the available GPUs, so I
> >>>                               think the main benefit of
> >> specifying
> >>> them
> >>>             is when you
> >>>                   are
> >>>                               running more than one job
> >>>                               and don't want the jobs sharing
> >> GPUs.
> >>>
> >>>                               I'm not sure you'll see great
> >> scaling
> >>>             across 6 GPUs for
> >>>                   a
> >>>                               single job, but that would
> >>>                               be great if you did.
> >>>
> >>>                               ~Aron
> >>>
> >>>                               On Tue, Jul 10, 2012 at 1:14
> PM,
> >>> Gianluca
> >>>             Interlandi
> >>>                               <gianluca_at_u.washington.edu>
> >>>                               wrote:
> >>>                                     Hi,
> >>>
> >>>                                     I have a question
> >> concerning
> >>> running
> >>>             NAMD on a
> >>>                   CUDA
> >>>                               cluster.
> >>>
> >>>                                     NCSA Forge has for
> example
> >> 6
> >>> CUDA
> >>>             devices and 16
> >>>                   CPU
> >>>                               cores per node. If I
> >>>                                     want to use all 6 CUDA
> >> devices
> >>> in a
> >>>             node, how
> >>>                   many
> >>>                               processes is it
> >>>                                     recommended to spawn? Do
> I
> >> need
> >>> to
> >>>             specify
> >>>                   "+devices"?
> >>>
> >>>                                     So, if for example I want
> >> to
> >>> spawn 12
> >>>             processes,
> >>>                   do I
> >>>                               need to specify:
> >>>
> >>>                                     charmrun +p12 -
> machinefile
> >>>             $PBS_NODEFILE +devices
> >>>                               0,1,2,3,4,5 namd2
> >>>                                     +idlepoll
> >>>
> >>>                                     Thanks,
> >>>
> >>>                                          Gianluca
> >>>
> >>>
> >>>
> >>> -----------------------------------------------------
> >>>                                     Gianluca Interlandi, PhD
> >>>                   gianluca_at_u.washington.edu
> >>>                                                         +1
> >> (206)
> >>> 685 4435
> >>>
> >>>
> >>> http://artemide.bioeng.washington.edu/
> >>>
> >>>                                     Research Scientist at the
> >>> Department
> >>>             of
> >>>                   Bioengineering
> >>>                                     at the University of
> >>> Washington,
> >>>             Seattle WA
> >>>                   U.S.A.
> >>>
> >>>
> >>> -----------------------------------------------------
> >>>
> >>>
> >>>
> >>>
> >>>                               --
> >>>                               Aron Broom M.Sc
> >>>                               PhD Student
> >>>                               Department of Chemistry
> >>>                               University of Waterloo
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>             -------------------------------------------------
> --
> >> --
> >>>                         Gianluca Interlandi, PhD
> >>>             gianluca_at_u.washington.edu
> >>>                                             +1 (206) 685 4435
> >>>
> >>>             http://artemide.bioeng.washington.edu/
> >>>
> >>>                         Research Scientist at the Department
> of
> >>>             Bioengineering
> >>>                         at the University of Washington,
> >> Seattle WA
> >>>             U.S.A.
> >>>
> >>>             -------------------------------------------------
> --
> >> --
> >>>
> >>>
> >>>
> >>>
> >>>                   --
> >>>                   Aron Broom M.Sc
> >>>                   PhD Student
> >>>                   Department of Chemistry
> >>>                   University of Waterloo
> >>>
> >>>
> >>>
> >>>
> >>>             -------------------------------------------------
> --
> >> --
> >>>             Gianluca Interlandi, PhD
> gianluca_at_u.washington.edu
> >>>                                 +1 (206) 685 4435
> >>>
> >>> http://artemide.bioeng.washington.edu/
> >>>
> >>>             Research Scientist at the Department of
> >> Bioengineering
> >>>             at the University of Washington, Seattle WA
> U.S.A.
> >>>             -------------------------------------------------
> --
> >> --
> >>>
> >>>
> >>>
> >>>
> >>>             --
> >>>             Aron Broom M.Sc
> >>>             PhD Student
> >>>             Department of Chemistry
> >>>             University of Waterloo
> >>>
> >>>
> >>>
> >>>
> >>>       -----------------------------------------------------
> >>>       Gianluca Interlandi, PhD gianluca_at_u.washington.edu
> >>>                           +1 (206) 685 4435
> >>>
> >> http://artemide.bioeng.washington.edu/
> >>>
> >>>       Research Scientist at the Department of Bioengineering
> >>>       at the University of Washington, Seattle WA U.S.A.
> >>>       -----------------------------------------------------
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> Aron Broom M.Sc
> >>> PhD Student
> >>> Department of Chemistry
> >>> University of Waterloo
> >>>
> >>>
> >>>
> >>>
> >>> -----------------------------------------------------
> >>> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
> >>>                     +1 (206) 685 4435
> >>>                     http://artemide.bioeng.washington.edu/
> >>>
> >>> Research Scientist at the Department of Bioengineering
> >>> at the University of Washington, Seattle WA U.S.A.
> >>> -----------------------------------------------------
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> Aron Broom M.Sc
> >>> PhD Student
> >>> Department of Chemistry
> >>> University of Waterloo
> >>>
> >>>
> >>>
> >>
> >> -----------------------------------------------------
> >> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
> >> +1 (206) 685 4435
> >> http://artemide.bioeng.washington.edu/
> >>
> >> Research Scientist at the Department of Bioengineering
> >> at the University of Washington, Seattle WA U.S.A.
> >> -----------------------------------------------------
> >
> >
>
> -----------------------------------------------------
> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
> +1 (206) 685 4435
> http://artemide.bioeng.washington.edu/
>
> Research Scientist at the Department of Bioengineering
> at the University of Washington, Seattle WA U.S.A.
> -----------------------------------------------------

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:47 CST