From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Fri Jul 13 2012 - 02:09:04 CDT
> -----Ursprüngliche Nachricht-----
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> Auftrag von Gianluca Interlandi
> Gesendet: Freitag, 13. Juli 2012 08:36
> An: Norman Geist
> Cc: Namd Mailing List
> Betreff: Re: AW: namd-l: Running NAMD on Forge (CUDA)
> 
> Hi Norman,
Hi,
> 
> > What value do you use for fullelectfequency??
> 
> The default. I haven't set it.
Ok, then it's 1 I guess. This is bad for GPU simulations as the
electrostatic is done on the cpu. This causes much traffic between cpu and
gpu and messes up the PCI-E. Additionally 6 GPU's I could imagine do also
need a lot of PCI-E bandwidth, so it's likely that the performance of the
GPUs is not as expected. You should try to set fullelectfrequency to at
least 4 and try out the new molly parameter. This should cause less traffic
on PCI-E and improve the GPUs utilization but does little harm the energy
conservation what shows up as slightly increasing temperature. But with the
molly parameter it should be ok I think.
> 
> > How many GPUs are there per node in this cluster?
> 
> 6
> 
> > What kind of interconnect?
> 
> Infiniband.
Please make sure if you are running over multiple nodes, that you make use
of the infiniband interconnect. Therefore you need a ibverbs binary of NAMD
or there must be IPoIB installed. You can see if IPoIB is working if there
is a ib0 interface for example when you do ifconfig. Also as I observed,
IPoIB should be configured with the connected mode and a mtu of about 65520
(cat /sys/class/net/ib0/mode or mtu to see the current settings)
> 
> Here are all specs:
> 
> http://www.ncsa.illinois.edu/UserInfo/Resources/Hardware/DellNVIDIAClus
> ter/TechSummary/index.html
> 
> Thanks,
> 
>       Gianluca
> 
> > Norman Geist.
> >
> >> -----Ursprüngliche Nachricht-----
> >> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> >> Auftrag von Gianluca Interlandi
> >> Gesendet: Freitag, 13. Juli 2012 00:26
> >> An: Aron Broom
> >> Cc: NAMD list
> >> Betreff: Re: namd-l: Running NAMD on Forge (CUDA)
> >>
> >> Yes, I was totally surprised, too. I also ran a non-CUDA job on
> Forge
> >> using 16 CPUs. I got 0.122076 s/step, which is 16% slower than using
> >> the 6
> >> GPUs (0.1 s/step) and a bit slower than 0.10932 s/step that I get on
> >> Trestles using 16 cores. This difference might be statistical
> >> fluctuations
> >> though (or configuration setup) since Forge and Trestles have the
> exact
> >> same CPU, i.e., eight-core 2.4 GHz Magny-Cours.
> >>
> >> Yes, Forge also uses NVIDIA M2070.
> >>>> I keep thinking of this guy here in Seattle who works for NVIDIA
> >> downtown
> >> and a few years ago he asked me: "How come you don't use CUDA?"
> Maybe
> >> the
> >> code still needs some optimization, and CPU manufacturers have been
> >> doing
> >> everything to catch up.
> >>
> >> Gianluca
> >>
> >> On Thu, 12 Jul 2012, Aron Broom wrote:
> >>
> >>> So your speed for 1 or 2 GPUs (based on what your sent) is about
> 1.7
> >> ns/day, which
> >>> seems decent given the system size.  I was getting 2.0 and 2.6
> ns/day
> >> for a 100k atom
> >>> system with roughly those same parameters (and also 6-cpu cores),
> so
> >> given a scaling
> >>> of ~nlogn, I would expect to see ~1.5 to 2.0 ns/day for you.  So in
> >> my mind, the
> >>> speed you are getting with the GPUs isn't so surprising, it's that
> >> you get such a
> >>> good speed with only the CPUs that shocks me.  In my case I didn't
> >> see speeds
> >>> matching my 1 GPU until 48 CPU cores alone.  Seems like those Magny
> >> Cours are pretty
> >>> awesome.
> >>>
> >>> Which GPUs are you using?  I was using mainly the M2070s.
> >>>
> >>> Also, one thing that might be useful, if you are able to get
> roughly
> >> the same speed
> >>> with 6 cores and 2 GPUs and you get with 16 cores alone, is to test
> >> running 3 jobs at
> >>> once, with 5 cores and 2 GPUs assigned to each and see how much
> >> slowdown there is.
> >>> You might be able to benefit from various replica techniques more
> >> than just hitting a
> >>> single job with more power.
> >>>
> >>> Still, the overall conclusion from what you've got seems to be that
> >> it makes more
> >>> sense to go with more of those CPUs rather than putting GPUs in
> >> there.
> >>>
> >>> ~Aron
> >>>
> >>> On Thu, Jul 12, 2012 at 4:58 PM, Gianluca Interlandi
> >> <gianluca_at_u.washington.edu>
> >>> wrote:
> >>>             What are your simulation parameters:
> >>>
> >>>             timestep (and also any multistepping values)
> >>>
> >>> 2 fs, SHAKE, no multistepping
> >>>
> >>>       cutoff (and also the pairlist and PME grid spacing)
> >>>
> >>> 8-10-12  PME grid spacing ~ 1 A
> >>>
> >>>       Have you tried giving it just 1 or 2 GPUs alone (using the
> >>>       +devices)?
> >>>
> >>>
> >>> Yes, this is the benchmark time:
> >>>
> >>> np 1:  0.48615 s/step
> >>> np 2:  0.26105 s/step
> >>> np 4:  0.14542 s/step
> >>> np 6:  0.10167 s/step
> >>>
> >>> I post here also part of the log running on 6 devices (in case it
> is
> >> helpful to
> >>> localize the problem):
> >>>
> >>> Pe 4 has 57 local and 64 remote patches and 1066 local and 473
> remote
> >> computes.
> >>> Pe 1 has 57 local and 65 remote patches and 1057 local and 482
> remote
> >> computes.
> >>> Pe 5 has 57 local and 56 remote patches and 1150 local and 389
> remote
> >> computes.
> >>> Pe 2 has 57 local and 57 remote patches and 1052 local and 487
> remote
> >> computes.
> >>> Pe 3 has 58 local and 57 remote patches and 1079 local and 487
> remote
> >> computes.
> >>> Pe 0 has 57 local and 57 remote patches and 1144 local and 395
> remote
> >> computes.
> >>>
> >>> Gianluca
> >>>
> >>>             Gianluca
> >>>
> >>>             On Thu, 12 Jul 2012, Aron Broom wrote:
> >>>
> >>>                   have you tried the multicore build?  I wonder if
> >> the
> >>>       prebuilt
> >>>                   smp one is just not
> >>>                   working for you.
> >>>
> >>>                   On Thu, Jul 12, 2012 at 3:21 PM, Gianluca
> >> Interlandi
> >>>                   <gianluca_at_u.washington.edu>
> >>>                   wrote:
> >>>                               are other people also using those
> GPUs?
> >>>
> >>>
> >>>                   I don't think so since I reserved the entire
> node.
> >>>
> >>>                         What are the benchmark timings that you are
> >> given
> >>>       after
> >>>                   ~1000
> >>>                         steps?
> >>>
> >>>
> >>>                   The benchmark time with 6 processes is 101 sec
> for
> >> 1000
> >>>                   steps. This is only
> >>>                   slightly faster than Trestles where I get 109 sec
> >> for
> >>>       1000
> >>>                   steps running on 16
> >>>                   CPUs. So, yes 6 GPUs on Forge are much faster
> than
> >> 6
> >>>       cores on
> >>>                   Trestles, but in
> >>>                   terms of SUs it makes no difference, since on
> Forge
> >> I
> >>>       still
> >>>                   have to reserve the
> >>>                   entire node (16 cores).
> >>>
> >>>                   Gianluca
> >>>
> >>>                         is some setup time.
> >>>
> >>>                         I often run a system of ~100,000 atoms, and
> I
> >>>       generally
> >>>                   see an
> >>>                         order of magnitude
> >>>                         improvement in speed compared to the same
> >> number
> >>>       of
> >>>                   cores without
> >>>                         the GPUs.  I would
> >>>                         test the non-CUDA precompiled cude on your
> >> Forge
> >>>       system
> >>>                   and see how
> >>>                         that compares, it
> >>>                         might be the fault of something other than
> >> CUDA.
> >>>
> >>>                         ~Aron
> >>>
> >>>                         On Thu, Jul 12, 2012 at 2:41 PM, Gianluca
> >>>       Interlandi
> >>>                         <gianluca_at_u.washington.edu>
> >>>                         wrote:
> >>>                               Hi Aron,
> >>>
> >>>                               Thanks for the explanations. I don't
> >> know
> >>>       whether
> >>>                   I'm doing
> >>>                         everything
> >>>                               right. I don't see any speed
> advantage
> >>>       running on
> >>>                   the CUDA
> >>>                         cluster
> >>>                               (Forge) versus running on a non-CUDA
> >>>       cluster.
> >>>
> >>>                               I did the following benchmarks on
> Forge
> >>>       (the
> >>>                   system has
> >>>                         127,000 atoms and
> >>>                               ran for 1000 steps):
> >>>
> >>>                               np 1:  506 sec
> >>>                               np 2:  281 sec
> >>>                               np 4:  163 sec
> >>>                               np 6:  136 sec
> >>>                               np 12: 218 sec
> >>>
> >>>                               On the other hand, running the same
> >> system
> >>>       on 16
> >>>                   cores of
> >>>                         Trestles (AMD
> >>>                               Magny Cours) takes 129 sec. It seems
> >> that
> >>>       I'm not
> >>>                   really
> >>>                         making good use
> >>>                               of SUs by running on the CUDA
> cluster.
> >> Or,
> >>>       maybe
> >>>                   I'm doing
> >>>                         something
> >>>                               wrong? I'm using the ibverbs-smp-CUDA
> >>>                   pre-compiled version of
> >>>                         NAMD 2.9.
> >>>
> >>>                               Thanks,
> >>>
> >>>                                    Gianluca
> >>>
> >>>                               On Tue, 10 Jul 2012, Aron Broom
> wrote:
> >>>
> >>>                                     if it is truly just one node,
> you
> >> can
> >>>       use
> >>>                   the
> >>>                         multicore-CUDA
> >>>                                     version and avoid the
> >>>                                     MPI charmrun stuff.  Still, it
> >> boils
> >>>       down
> >>>                   to much the
> >>>                         same
> >>>                                     thing I think.  If you do
> >>>                                     what you've done below, you are
> >>>       running one
> >>>                   job with 12
> >>>                         CPU
> >>>                                     cores and all GPUs.  If
> >>>                                     you don't specify the +devices,
> >> NAMD
> >>>       will
> >>>                   automatically
> >>>                         find
> >>>                                     the available GPUs, so I
> >>>                                     think the main benefit of
> >> specifying
> >>>       them
> >>>                   is when you
> >>>                         are
> >>>                                     running more than one job
> >>>                                     and don't want the jobs sharing
> >> GPUs.
> >>>
> >>>                                     I'm not sure you'll see great
> >> scaling
> >>>                   across 6 GPUs for
> >>>                         a
> >>>                                     single job, but that would
> >>>                                     be great if you did.
> >>>
> >>>                                     ~Aron
> >>>
> >>>                                     On Tue, Jul 10, 2012 at 1:14
> PM,
> >>>       Gianluca
> >>>                   Interlandi
> >>>                                     <gianluca_at_u.washington.edu>
> >>>                                     wrote:
> >>>                                           Hi,
> >>>
> >>>                                           I have a question
> >> concerning
> >>>       running
> >>>                   NAMD on a
> >>>                         CUDA
> >>>                                     cluster.
> >>>
> >>>                                           NCSA Forge has for
> example
> >> 6
> >>>       CUDA
> >>>                   devices and 16
> >>>                         CPU
> >>>                                     cores per node. If I
> >>>                                           want to use all 6 CUDA
> >> devices
> >>>       in a
> >>>                   node, how
> >>>                         many
> >>>                                     processes is it
> >>>                                           recommended to spawn? Do
> I
> >> need
> >>>       to
> >>>                   specify
> >>>                         "+devices"?
> >>>
> >>>                                           So, if for example I want
> >> to
> >>>       spawn 12
> >>>                   processes,
> >>>                         do I
> >>>                                     need to specify:
> >>>
> >>>                                           charmrun +p12 -
> machinefile
> >>>                   $PBS_NODEFILE +devices
> >>>                                     0,1,2,3,4,5 namd2
> >>>                                           +idlepoll
> >>>
> >>>                                           Thanks,
> >>>
> >>>                                                Gianluca
> >>>
> >>>
> >>>
> >>>       -----------------------------------------------------
> >>>                                           Gianluca Interlandi, PhD
> >>>                         gianluca_at_u.washington.edu
> >>>                                                               +1
> >> (206)
> >>>       685 4435
> >>>
> >>>
> >>>       http://artemide.bioeng.washington.edu/
> >>>
> >>>                                           Research Scientist at the
> >>>       Department
> >>>                   of
> >>>                         Bioengineering
> >>>                                           at the University of
> >>>       Washington,
> >>>                   Seattle WA
> >>>                         U.S.A.
> >>>
> >>>
> >>>       -----------------------------------------------------
> >>>
> >>>
> >>>
> >>>
> >>>                                     --
> >>>                                     Aron Broom M.Sc
> >>>                                     PhD Student
> >>>                                     Department of Chemistry
> >>>                                     University of Waterloo
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>                   -------------------------------------------------
> --
> >> --
> >>>                               Gianluca Interlandi, PhD
> >>>                   gianluca_at_u.washington.edu
> >>>                                                   +1 (206) 685 4435
> >>>
> >>>                   http://artemide.bioeng.washington.edu/
> >>>
> >>>                               Research Scientist at the Department
> of
> >>>                   Bioengineering
> >>>                               at the University of Washington,
> >> Seattle WA
> >>>                   U.S.A.
> >>>
> >>>                   -------------------------------------------------
> --
> >> --
> >>>
> >>>
> >>>
> >>>
> >>>                         --
> >>>                         Aron Broom M.Sc
> >>>                         PhD Student
> >>>                         Department of Chemistry
> >>>                         University of Waterloo
> >>>
> >>>
> >>>
> >>>
> >>>                   -------------------------------------------------
> --
> >> --
> >>>                   Gianluca Interlandi, PhD
> gianluca_at_u.washington.edu
> >>>                                       +1 (206) 685 4435
> >>>
> >>>       http://artemide.bioeng.washington.edu/
> >>>
> >>>                   Research Scientist at the Department of
> >> Bioengineering
> >>>                   at the University of Washington, Seattle WA
> U.S.A.
> >>>                   -------------------------------------------------
> --
> >> --
> >>>
> >>>
> >>>
> >>>
> >>>                   --
> >>>                   Aron Broom M.Sc
> >>>                   PhD Student
> >>>                   Department of Chemistry
> >>>                   University of Waterloo
> >>>
> >>>
> >>>
> >>>
> >>>             -----------------------------------------------------
> >>>             Gianluca Interlandi, PhD gianluca_at_u.washington.edu
> >>>                                 +1 (206) 685 4435
> >>>
> >> http://artemide.bioeng.washington.edu/
> >>>
> >>>             Research Scientist at the Department of Bioengineering
> >>>             at the University of Washington, Seattle WA U.S.A.
> >>>             -----------------------------------------------------
> >>>
> >>>
> >>>
> >>>
> >>>       --
> >>>       Aron Broom M.Sc
> >>>       PhD Student
> >>>       Department of Chemistry
> >>>       University of Waterloo
> >>>
> >>>
> >>>
> >>>
> >>> -----------------------------------------------------
> >>> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
> >>>                     +1 (206) 685 4435
> >>>                     http://artemide.bioeng.washington.edu/
> >>>
> >>> Research Scientist at the Department of Bioengineering
> >>> at the University of Washington, Seattle WA U.S.A.
> >>> -----------------------------------------------------
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> Aron Broom M.Sc
> >>> PhD Student
> >>> Department of Chemistry
> >>> University of Waterloo
> >>>
> >>>
> >>>
> >>
> >> -----------------------------------------------------
> >> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
> >>                      +1 (206) 685 4435
> >>                      http://artemide.bioeng.washington.edu/
> >>
> >> Research Scientist at the Department of Bioengineering
> >> at the University of Washington, Seattle WA U.S.A.
> >> -----------------------------------------------------
> >
> >
> 
> -----------------------------------------------------
> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
>                      +1 (206) 685 4435
>                      http://artemide.bioeng.washington.edu/
> 
> Research Scientist at the Department of Bioengineering
> at the University of Washington, Seattle WA U.S.A.
> -----------------------------------------------------
This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:22:15 CST