AW: Running NAMD on Forge (CUDA)

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Fri Jul 13 2012 - 01:28:49 CDT

Next message: Gianluca Interlandi: "Re: AW: Running NAMD on Forge (CUDA)"
Previous message: Aron Broom: "Re: Am I doing the Right Thing?"
In reply to: Gianluca Interlandi: "Re: Running NAMD on Forge (CUDA)"
Next in thread: Gianluca Interlandi: "Re: AW: Running NAMD on Forge (CUDA)"
Maybe reply: Gianluca Interlandi: "Re: AW: Running NAMD on Forge (CUDA)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

Hi!

What value do you use for fullelectfequency??
How many GPUs are there per node in this cluster?
What kind of interconnect?

Norman Geist.

> -----Ursprüngliche Nachricht-----
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> Auftrag von Gianluca Interlandi
> Gesendet: Freitag, 13. Juli 2012 00:26
> An: Aron Broom
> Cc: NAMD list
> Betreff: Re: namd-l: Running NAMD on Forge (CUDA)
>
> Yes, I was totally surprised, too. I also ran a non-CUDA job on Forge
> using 16 CPUs. I got 0.122076 s/step, which is 16% slower than using
> the 6
> GPUs (0.1 s/step) and a bit slower than 0.10932 s/step that I get on
> Trestles using 16 cores. This difference might be statistical
> fluctuations
> though (or configuration setup) since Forge and Trestles have the exact
> same CPU, i.e., eight-core 2.4 GHz Magny-Cours.
>
> Yes, Forge also uses NVIDIA M2070.
>
> I keep thinking of this guy here in Seattle who works for NVIDIA
> downtown
> and a few years ago he asked me: "How come you don't use CUDA?" Maybe
> the
> code still needs some optimization, and CPU manufacturers have been
> doing
> everything to catch up.
>
> Gianluca
>
> On Thu, 12 Jul 2012, Aron Broom wrote:
>
> > So your speed for 1 or 2 GPUs (based on what your sent) is about 1.7
> ns/day, which
> > seems decent given the system size. I was getting 2.0 and 2.6 ns/day
> for a 100k atom
> > system with roughly those same parameters (and also 6-cpu cores), so
> given a scaling
> > of ~nlogn, I would expect to see ~1.5 to 2.0 ns/day for you. So in
> my mind, the
> > speed you are getting with the GPUs isn't so surprising, it's that
> you get such a
> > good speed with only the CPUs that shocks me. In my case I didn't
> see speeds
> > matching my 1 GPU until 48 CPU cores alone. Seems like those Magny
> Cours are pretty
> > awesome.
> >
> > Which GPUs are you using? I was using mainly the M2070s.
> >
> > Also, one thing that might be useful, if you are able to get roughly
> the same speed
> > with 6 cores and 2 GPUs and you get with 16 cores alone, is to test
> running 3 jobs at
> > once, with 5 cores and 2 GPUs assigned to each and see how much
> slowdown there is.
> > You might be able to benefit from various replica techniques more
> than just hitting a
> > single job with more power.
> >
> > Still, the overall conclusion from what you've got seems to be that
> it makes more
> > sense to go with more of those CPUs rather than putting GPUs in
> there.
> >
> > ~Aron
> >
> > On Thu, Jul 12, 2012 at 4:58 PM, Gianluca Interlandi
> <gianluca_at_u.washington.edu>
> > wrote:
> > What are your simulation parameters:
> >
> > timestep (and also any multistepping values)
> >
> > 2 fs, SHAKE, no multistepping
> >
> > cutoff (and also the pairlist and PME grid spacing)
> >
> > 8-10-12 PME grid spacing ~ 1 A
> >
> > Have you tried giving it just 1 or 2 GPUs alone (using the
> > +devices)?
> >
> >
> > Yes, this is the benchmark time:
> >
> > np 1: 0.48615 s/step
> > np 2: 0.26105 s/step
> > np 4: 0.14542 s/step
> > np 6: 0.10167 s/step
> >
> > I post here also part of the log running on 6 devices (in case it is
> helpful to
> > localize the problem):
> >
> > Pe 4 has 57 local and 64 remote patches and 1066 local and 473 remote
> computes.
> > Pe 1 has 57 local and 65 remote patches and 1057 local and 482 remote
> computes.
> > Pe 5 has 57 local and 56 remote patches and 1150 local and 389 remote
> computes.
> > Pe 2 has 57 local and 57 remote patches and 1052 local and 487 remote
> computes.
> > Pe 3 has 58 local and 57 remote patches and 1079 local and 487 remote
> computes.
> > Pe 0 has 57 local and 57 remote patches and 1144 local and 395 remote
> computes.
> >
> > Gianluca
> >
> > Gianluca
> >
> > On Thu, 12 Jul 2012, Aron Broom wrote:
> >
> > have you tried the multicore build? I wonder if
> the
> > prebuilt
> > smp one is just not
> > working for you.
> >
> > On Thu, Jul 12, 2012 at 3:21 PM, Gianluca
> Interlandi
> > <gianluca_at_u.washington.edu>
> > wrote:
> > are other people also using those GPUs?
> >
> >
> > I don't think so since I reserved the entire node.
> >
> > What are the benchmark timings that you are
> given
> > after
> > ~1000
> > steps?
> >
> >
> > The benchmark time with 6 processes is 101 sec for
> 1000
> > steps. This is only
> > slightly faster than Trestles where I get 109 sec
> for
> > 1000
> > steps running on 16
> > CPUs. So, yes 6 GPUs on Forge are much faster than
> 6
> > cores on
> > Trestles, but in
> > terms of SUs it makes no difference, since on Forge
> I
> > still
> > have to reserve the
> > entire node (16 cores).
> >
> > Gianluca
> >
> > is some setup time.
> >
> > I often run a system of ~100,000 atoms, and I
> > generally
> > see an
> > order of magnitude
> > improvement in speed compared to the same
> number
> > of
> > cores without
> > the GPUs. I would
> > test the non-CUDA precompiled cude on your
> Forge
> > system
> > and see how
> > that compares, it
> > might be the fault of something other than
> CUDA.
> >
> > ~Aron
> >
> > On Thu, Jul 12, 2012 at 2:41 PM, Gianluca
> > Interlandi
> > <gianluca_at_u.washington.edu>
> > wrote:
> > Hi Aron,
> >
> > Thanks for the explanations. I don't
> know
> > whether
> > I'm doing
> > everything
> > right. I don't see any speed advantage
> > running on
> > the CUDA
> > cluster
> > (Forge) versus running on a non-CUDA
> > cluster.
> >
> > I did the following benchmarks on Forge
> > (the
> > system has
> > 127,000 atoms and
> > ran for 1000 steps):
> >
> > np 1: 506 sec
> > np 2: 281 sec
> > np 4: 163 sec
> > np 6: 136 sec
> > np 12: 218 sec
> >
> > On the other hand, running the same
> system
> > on 16
> > cores of
> > Trestles (AMD
> > Magny Cours) takes 129 sec. It seems
> that
> > I'm not
> > really
> > making good use
> > of SUs by running on the CUDA cluster.
> Or,
> > maybe
> > I'm doing
> > something
> > wrong? I'm using the ibverbs-smp-CUDA
> > pre-compiled version of
> > NAMD 2.9.
> >
> > Thanks,
> >
> > Gianluca
> >
> > On Tue, 10 Jul 2012, Aron Broom wrote:
> >
> > if it is truly just one node, you
> can
> > use
> > the
> > multicore-CUDA
> > version and avoid the
> > MPI charmrun stuff. Still, it
> boils
> > down
> > to much the
> > same
> > thing I think. If you do
> > what you've done below, you are
> > running one
> > job with 12
> > CPU
> > cores and all GPUs. If
> > you don't specify the +devices,
> NAMD
> > will
> > automatically
> > find
> > the available GPUs, so I
> > think the main benefit of
> specifying
> > them
> > is when you
> > are
> > running more than one job
> > and don't want the jobs sharing
> GPUs.
> >
> > I'm not sure you'll see great
> scaling
> > across 6 GPUs for
> > a
> > single job, but that would
> > be great if you did.
> >
> > ~Aron
> >
> > On Tue, Jul 10, 2012 at 1:14 PM,
> > Gianluca
> > Interlandi
> > <gianluca_at_u.washington.edu>
> > wrote:
> > Hi,
> >
> > I have a question
> concerning
> > running
> > NAMD on a
> > CUDA
> > cluster.
> >
> > NCSA Forge has for example
> 6
> > CUDA
> > devices and 16
> > CPU
> > cores per node. If I
> > want to use all 6 CUDA
> devices
> > in a
> > node, how
> > many
> > processes is it
> > recommended to spawn? Do I
> need
> > to
> > specify
> > "+devices"?
> >
> > So, if for example I want
> to
> > spawn 12
> > processes,
> > do I
> > need to specify:
> >
> > charmrun +p12 -machinefile
> > $PBS_NODEFILE +devices
> > 0,1,2,3,4,5 namd2
> > +idlepoll
> >
> > Thanks,
> >
> > Gianluca
> >
> >
> >
> > -----------------------------------------------------
> > Gianluca Interlandi, PhD
> > gianluca_at_u.washington.edu
> > +1
> (206)
> > 685 4435
> >
> >
> > http://artemide.bioeng.washington.edu/
> >
> > Research Scientist at the
> > Department
> > of
> > Bioengineering
> > at the University of
> > Washington,
> > Seattle WA
> > U.S.A.
> >
> >
> > -----------------------------------------------------
> >
> >
> >
> >
> > --
> > Aron Broom M.Sc
> > PhD Student
> > Department of Chemistry
> > University of Waterloo
> >
> >
> >
> >
> >
> > ---------------------------------------------------
> --
> > Gianluca Interlandi, PhD
> > gianluca_at_u.washington.edu
> > +1 (206) 685 4435
> >
> > http://artemide.bioeng.washington.edu/
> >
> > Research Scientist at the Department of
> > Bioengineering
> > at the University of Washington,
> Seattle WA
> > U.S.A.
> >
> > ---------------------------------------------------
> --
> >
> >
> >
> >
> > --
> > Aron Broom M.Sc
> > PhD Student
> > Department of Chemistry
> > University of Waterloo
> >
> >
> >
> >
> > ---------------------------------------------------
> --
> > Gianluca Interlandi, PhD gianluca_at_u.washington.edu
> > +1 (206) 685 4435
> >
> > http://artemide.bioeng.washington.edu/
> >
> > Research Scientist at the Department of
> Bioengineering
> > at the University of Washington, Seattle WA U.S.A.
> > ---------------------------------------------------
> --
> >
> >
> >
> >
> > --
> > Aron Broom M.Sc
> > PhD Student
> > Department of Chemistry
> > University of Waterloo
> >
> >
> >
> >
> > -----------------------------------------------------
> > Gianluca Interlandi, PhD gianluca_at_u.washington.edu
> > +1 (206) 685 4435
> >
> http://artemide.bioeng.washington.edu/
> >
> > Research Scientist at the Department of Bioengineering
> > at the University of Washington, Seattle WA U.S.A.
> > -----------------------------------------------------
> >
> >
> >
> >
> > --
> > Aron Broom M.Sc
> > PhD Student
> > Department of Chemistry
> > University of Waterloo
> >
> >
> >
> >
> > -----------------------------------------------------
> > Gianluca Interlandi, PhD gianluca_at_u.washington.edu
> > +1 (206) 685 4435
> > http://artemide.bioeng.washington.edu/
> >
> > Research Scientist at the Department of Bioengineering
> > at the University of Washington, Seattle WA U.S.A.
> > -----------------------------------------------------
> >
> >
> >
> >
> > --
> > Aron Broom M.Sc
> > PhD Student
> > Department of Chemistry
> > University of Waterloo
> >
> >
> >
>
> -----------------------------------------------------
> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
> +1 (206) 685 4435
> http://artemide.bioeng.washington.edu/
>
> Research Scientist at the Department of Bioengineering
> at the University of Washington, Seattle WA U.S.A.
> -----------------------------------------------------

Next message: Gianluca Interlandi: "Re: AW: Running NAMD on Forge (CUDA)"
Previous message: Aron Broom: "Re: Am I doing the Right Thing?"
In reply to: Gianluca Interlandi: "Re: Running NAMD on Forge (CUDA)"
Next in thread: Gianluca Interlandi: "Re: AW: Running NAMD on Forge (CUDA)"
Maybe reply: Gianluca Interlandi: "Re: AW: Running NAMD on Forge (CUDA)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:47 CST