AW: Running NAMD on Forge (CUDA)

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Fri Jul 13 2012 - 01:28:49 CDT

Hi!

What value do you use for fullelectfequency??
How many GPUs are there per node in this cluster?
What kind of interconnect?

Norman Geist.

> -----Ursprüngliche Nachricht-----
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> Auftrag von Gianluca Interlandi
> Gesendet: Freitag, 13. Juli 2012 00:26
> An: Aron Broom
> Cc: NAMD list
> Betreff: Re: namd-l: Running NAMD on Forge (CUDA)
>
> Yes, I was totally surprised, too. I also ran a non-CUDA job on Forge
> using 16 CPUs. I got 0.122076 s/step, which is 16% slower than using
> the 6
> GPUs (0.1 s/step) and a bit slower than 0.10932 s/step that I get on
> Trestles using 16 cores. This difference might be statistical
> fluctuations
> though (or configuration setup) since Forge and Trestles have the exact
> same CPU, i.e., eight-core 2.4 GHz Magny-Cours.
>
> Yes, Forge also uses NVIDIA M2070.
>
> I keep thinking of this guy here in Seattle who works for NVIDIA
> downtown
> and a few years ago he asked me: "How come you don't use CUDA?" Maybe
> the
> code still needs some optimization, and CPU manufacturers have been
> doing
> everything to catch up.
>
> Gianluca
>
> On Thu, 12 Jul 2012, Aron Broom wrote:
>
> > So your speed for 1 or 2 GPUs (based on what your sent) is about 1.7
> ns/day, which
> > seems decent given the system size.  I was getting 2.0 and 2.6 ns/day
> for a 100k atom
> > system with roughly those same parameters (and also 6-cpu cores), so
> given a scaling
> > of ~nlogn, I would expect to see ~1.5 to 2.0 ns/day for you.  So in
> my mind, the
> > speed you are getting with the GPUs isn't so surprising, it's that
> you get such a
> > good speed with only the CPUs that shocks me.  In my case I didn't
> see speeds
> > matching my 1 GPU until 48 CPU cores alone.  Seems like those Magny
> Cours are pretty
> > awesome.
> >
> > Which GPUs are you using?  I was using mainly the M2070s.
> >
> > Also, one thing that might be useful, if you are able to get roughly
> the same speed
> > with 6 cores and 2 GPUs and you get with 16 cores alone, is to test
> running 3 jobs at
> > once, with 5 cores and 2 GPUs assigned to each and see how much
> slowdown there is.
> > You might be able to benefit from various replica techniques more
> than just hitting a
> > single job with more power.
> >
> > Still, the overall conclusion from what you've got seems to be that
> it makes more
> > sense to go with more of those CPUs rather than putting GPUs in
> there.
> >
> > ~Aron
> >
> > On Thu, Jul 12, 2012 at 4:58 PM, Gianluca Interlandi
> <gianluca_at_u.washington.edu>
> > wrote:
> > What are your simulation parameters:
> >
> > timestep (and also any multistepping values)
> >
> > 2 fs, SHAKE, no multistepping
> >
> > cutoff (and also the pairlist and PME grid spacing)
> >
> > 8-10-12  PME grid spacing ~ 1 A
> >
> > Have you tried giving it just 1 or 2 GPUs alone (using the
> > +devices)?
> >
> >
> > Yes, this is the benchmark time:
> >
> > np 1:  0.48615 s/step
> > np 2:  0.26105 s/step
> > np 4:  0.14542 s/step
> > np 6:  0.10167 s/step
> >
> > I post here also part of the log running on 6 devices (in case it is
> helpful to
> > localize the problem):
> >
> > Pe 4 has 57 local and 64 remote patches and 1066 local and 473 remote
> computes.
> > Pe 1 has 57 local and 65 remote patches and 1057 local and 482 remote
> computes.
> > Pe 5 has 57 local and 56 remote patches and 1150 local and 389 remote
> computes.
> > Pe 2 has 57 local and 57 remote patches and 1052 local and 487 remote
> computes.
> > Pe 3 has 58 local and 57 remote patches and 1079 local and 487 remote
> computes.
> > Pe 0 has 57 local and 57 remote patches and 1144 local and 395 remote
> computes.
> >
> > Gianluca
> >
> >       Gianluca
> >
> >       On Thu, 12 Jul 2012, Aron Broom wrote:
> >
> >             have you tried the multicore build?  I wonder if
> the
> > prebuilt
> >             smp one is just not
> >             working for you.
> >
> >             On Thu, Jul 12, 2012 at 3:21 PM, Gianluca
> Interlandi
> >             <gianluca_at_u.washington.edu>
> >             wrote:
> >                         are other people also using those GPUs?
> >
> >
> >             I don't think so since I reserved the entire node.
> >
> >                   What are the benchmark timings that you are
> given
> > after
> >             ~1000
> >                   steps?
> >
> >
> >             The benchmark time with 6 processes is 101 sec for
> 1000
> >             steps. This is only
> >             slightly faster than Trestles where I get 109 sec
> for
> > 1000
> >             steps running on 16
> >             CPUs. So, yes 6 GPUs on Forge are much faster than
> 6
> > cores on
> >             Trestles, but in
> >             terms of SUs it makes no difference, since on Forge
> I
> > still
> >             have to reserve the
> >             entire node (16 cores).
> >
> >             Gianluca
> >
> >                   is some setup time.
> >
> >                   I often run a system of ~100,000 atoms, and I
> > generally
> >             see an
> >                   order of magnitude
> >                   improvement in speed compared to the same
> number
> > of
> >             cores without
> >                   the GPUs.  I would
> >                   test the non-CUDA precompiled cude on your
> Forge
> > system
> >             and see how
> >                   that compares, it
> >                   might be the fault of something other than
> CUDA.
> >
> >                   ~Aron
> >
> >                   On Thu, Jul 12, 2012 at 2:41 PM, Gianluca
> > Interlandi
> >                   <gianluca_at_u.washington.edu>
> >                   wrote:
> >                         Hi Aron,
> >
> >                         Thanks for the explanations. I don't
> know
> > whether
> >             I'm doing
> >                   everything
> >                         right. I don't see any speed advantage
> > running on
> >             the CUDA
> >                   cluster
> >                         (Forge) versus running on a non-CUDA
> > cluster.
> >
> >                         I did the following benchmarks on Forge
> > (the
> >             system has
> >                   127,000 atoms and
> >                         ran for 1000 steps):
> >
> >                         np 1:  506 sec
> >                         np 2:  281 sec
> >                         np 4:  163 sec
> >                         np 6:  136 sec
> >                         np 12: 218 sec
> >
> >                         On the other hand, running the same
> system
> > on 16
> >             cores of
> >                   Trestles (AMD
> >                         Magny Cours) takes 129 sec. It seems
> that
> > I'm not
> >             really
> >                   making good use
> >                         of SUs by running on the CUDA cluster.
> Or,
> > maybe
> >             I'm doing
> >                   something
> >                         wrong? I'm using the ibverbs-smp-CUDA
> >             pre-compiled version of
> >                   NAMD 2.9.
> >
> >                         Thanks,
> >
> >                              Gianluca
> >
> >                         On Tue, 10 Jul 2012, Aron Broom wrote:
> >
> >                               if it is truly just one node, you
> can
> > use
> >             the
> >                   multicore-CUDA
> >                               version and avoid the
> >                               MPI charmrun stuff.  Still, it
> boils
> > down
> >             to much the
> >                   same
> >                               thing I think.  If you do
> >                               what you've done below, you are
> > running one
> >             job with 12
> >                   CPU
> >                               cores and all GPUs.  If
> >                               you don't specify the +devices,
> NAMD
> > will
> >             automatically
> >                   find
> >                               the available GPUs, so I
> >                               think the main benefit of
> specifying
> > them
> >             is when you
> >                   are
> >                               running more than one job
> >                               and don't want the jobs sharing
> GPUs.
> >
> >                               I'm not sure you'll see great
> scaling
> >             across 6 GPUs for
> >                   a
> >                               single job, but that would
> >                               be great if you did.
> >
> >                               ~Aron
> >
> >                               On Tue, Jul 10, 2012 at 1:14 PM,
> > Gianluca
> >             Interlandi
> >                               <gianluca_at_u.washington.edu>
> >                               wrote:
> >                                     Hi,
> >
> >                                     I have a question
> concerning
> > running
> >             NAMD on a
> >                   CUDA
> >                               cluster.
> >
> >                                     NCSA Forge has for example
> 6
> > CUDA
> >             devices and 16
> >                   CPU
> >                               cores per node. If I
> >                                     want to use all 6 CUDA
> devices
> > in a
> >             node, how
> >                   many
> >                               processes is it
> >                                     recommended to spawn? Do I
> need
> > to
> >             specify
> >                   "+devices"?
> >
> >                                     So, if for example I want
> to
> > spawn 12
> >             processes,
> >                   do I
> >                               need to specify:
> >
> >                                     charmrun +p12 -machinefile
> >             $PBS_NODEFILE +devices
> >                               0,1,2,3,4,5 namd2
> >                                     +idlepoll
> >
> >                                     Thanks,
> >
> >                                          Gianluca
> >
> >
> >
> > -----------------------------------------------------
> >                                     Gianluca Interlandi, PhD
> >                   gianluca_at_u.washington.edu
> >                                                         +1
> (206)
> > 685 4435
> >
> >
> > http://artemide.bioeng.washington.edu/
> >
> >                                     Research Scientist at the
> > Department
> >             of
> >                   Bioengineering
> >                                     at the University of
> > Washington,
> >             Seattle WA
> >                   U.S.A.
> >
> >
> > -----------------------------------------------------
> >
> >
> >
> >
> >                               --
> >                               Aron Broom M.Sc
> >                               PhD Student
> >                               Department of Chemistry
> >                               University of Waterloo
> >
> >
> >
> >
> >
> >             ---------------------------------------------------
> --
> >                         Gianluca Interlandi, PhD
> >             gianluca_at_u.washington.edu
> >                                             +1 (206) 685 4435
> >
> >             http://artemide.bioeng.washington.edu/
> >
> >                         Research Scientist at the Department of
> >             Bioengineering
> >                         at the University of Washington,
> Seattle WA
> >             U.S.A.
> >
> >             ---------------------------------------------------
> --
> >
> >
> >
> >
> >                   --
> >                   Aron Broom M.Sc
> >                   PhD Student
> >                   Department of Chemistry
> >                   University of Waterloo
> >
> >
> >
> >
> >             ---------------------------------------------------
> --
> >             Gianluca Interlandi, PhD gianluca_at_u.washington.edu
> >                                 +1 (206) 685 4435
> >
> > http://artemide.bioeng.washington.edu/
> >
> >             Research Scientist at the Department of
> Bioengineering
> >             at the University of Washington, Seattle WA U.S.A.
> >             ---------------------------------------------------
> --
> >
> >
> >
> >
> >             --
> >             Aron Broom M.Sc
> >             PhD Student
> >             Department of Chemistry
> >             University of Waterloo
> >
> >
> >
> >
> >       -----------------------------------------------------
> >       Gianluca Interlandi, PhD gianluca_at_u.washington.edu
> >                           +1 (206) 685 4435
> >
> http://artemide.bioeng.washington.edu/
> >
> >       Research Scientist at the Department of Bioengineering
> >       at the University of Washington, Seattle WA U.S.A.
> >       -----------------------------------------------------
> >
> >
> >
> >
> > --
> > Aron Broom M.Sc
> > PhD Student
> > Department of Chemistry
> > University of Waterloo
> >
> >
> >
> >
> > -----------------------------------------------------
> > Gianluca Interlandi, PhD gianluca_at_u.washington.edu
> >                     +1 (206) 685 4435
> >                     http://artemide.bioeng.washington.edu/
> >
> > Research Scientist at the Department of Bioengineering
> > at the University of Washington, Seattle WA U.S.A.
> > -----------------------------------------------------
> >
> >
> >
> >
> > --
> > Aron Broom M.Sc
> > PhD Student
> > Department of Chemistry
> > University of Waterloo
> >
> >
> >
>
> -----------------------------------------------------
> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
> +1 (206) 685 4435
> http://artemide.bioeng.washington.edu/
>
> Research Scientist at the Department of Bioengineering
> at the University of Washington, Seattle WA U.S.A.
> -----------------------------------------------------

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:47 CST