From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Fri Jul 13 2012 - 02:09:04 CDT
> -----Ursprüngliche Nachricht-----
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> Auftrag von Gianluca Interlandi
> Gesendet: Freitag, 13. Juli 2012 08:36
> An: Norman Geist
> Cc: Namd Mailing List
> Betreff: Re: AW: namd-l: Running NAMD on Forge (CUDA)
>
> Hi Norman,
Hi,
>
> > What value do you use for fullelectfequency??
>
> The default. I haven't set it.
Ok, then it's 1 I guess. This is bad for GPU simulations as the
electrostatic is done on the cpu. This causes much traffic between cpu and
gpu and messes up the PCI-E. Additionally 6 GPU's I could imagine do also
need a lot of PCI-E bandwidth, so it's likely that the performance of the
GPUs is not as expected. You should try to set fullelectfrequency to at
least 4 and try out the new molly parameter. This should cause less traffic
on PCI-E and improve the GPUs utilization but does little harm the energy
conservation what shows up as slightly increasing temperature. But with the
molly parameter it should be ok I think.
>
> > How many GPUs are there per node in this cluster?
>
> 6
>
> > What kind of interconnect?
>
> Infiniband.
Please make sure if you are running over multiple nodes, that you make use
of the infiniband interconnect. Therefore you need a ibverbs binary of NAMD
or there must be IPoIB installed. You can see if IPoIB is working if there
is a ib0 interface for example when you do ifconfig. Also as I observed,
IPoIB should be configured with the connected mode and a mtu of about 65520
(cat /sys/class/net/ib0/mode or mtu to see the current settings)
>
> Here are all specs:
>
> http://www.ncsa.illinois.edu/UserInfo/Resources/Hardware/DellNVIDIAClus
> ter/TechSummary/index.html
>
> Thanks,
>
> Gianluca
>
> > Norman Geist.
> >
> >> -----Ursprüngliche Nachricht-----
> >> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> >> Auftrag von Gianluca Interlandi
> >> Gesendet: Freitag, 13. Juli 2012 00:26
> >> An: Aron Broom
> >> Cc: NAMD list
> >> Betreff: Re: namd-l: Running NAMD on Forge (CUDA)
> >>
> >> Yes, I was totally surprised, too. I also ran a non-CUDA job on
> Forge
> >> using 16 CPUs. I got 0.122076 s/step, which is 16% slower than using
> >> the 6
> >> GPUs (0.1 s/step) and a bit slower than 0.10932 s/step that I get on
> >> Trestles using 16 cores. This difference might be statistical
> >> fluctuations
> >> though (or configuration setup) since Forge and Trestles have the
> exact
> >> same CPU, i.e., eight-core 2.4 GHz Magny-Cours.
> >>
> >> Yes, Forge also uses NVIDIA M2070.
> >>>> I keep thinking of this guy here in Seattle who works for NVIDIA
> >> downtown
> >> and a few years ago he asked me: "How come you don't use CUDA?"
> Maybe
> >> the
> >> code still needs some optimization, and CPU manufacturers have been
> >> doing
> >> everything to catch up.
> >>
> >> Gianluca
> >>
> >> On Thu, 12 Jul 2012, Aron Broom wrote:
> >>
> >>> So your speed for 1 or 2 GPUs (based on what your sent) is about
> 1.7
> >> ns/day, which
> >>> seems decent given the system size. I was getting 2.0 and 2.6
> ns/day
> >> for a 100k atom
> >>> system with roughly those same parameters (and also 6-cpu cores),
> so
> >> given a scaling
> >>> of ~nlogn, I would expect to see ~1.5 to 2.0 ns/day for you. So in
> >> my mind, the
> >>> speed you are getting with the GPUs isn't so surprising, it's that
> >> you get such a
> >>> good speed with only the CPUs that shocks me. In my case I didn't
> >> see speeds
> >>> matching my 1 GPU until 48 CPU cores alone. Seems like those Magny
> >> Cours are pretty
> >>> awesome.
> >>>
> >>> Which GPUs are you using? I was using mainly the M2070s.
> >>>
> >>> Also, one thing that might be useful, if you are able to get
> roughly
> >> the same speed
> >>> with 6 cores and 2 GPUs and you get with 16 cores alone, is to test
> >> running 3 jobs at
> >>> once, with 5 cores and 2 GPUs assigned to each and see how much
> >> slowdown there is.
> >>> You might be able to benefit from various replica techniques more
> >> than just hitting a
> >>> single job with more power.
> >>>
> >>> Still, the overall conclusion from what you've got seems to be that
> >> it makes more
> >>> sense to go with more of those CPUs rather than putting GPUs in
> >> there.
> >>>
> >>> ~Aron
> >>>
> >>> On Thu, Jul 12, 2012 at 4:58 PM, Gianluca Interlandi
> >> <gianluca_at_u.washington.edu>
> >>> wrote:
> >>> What are your simulation parameters:
> >>>
> >>> timestep (and also any multistepping values)
> >>>
> >>> 2 fs, SHAKE, no multistepping
> >>>
> >>> cutoff (and also the pairlist and PME grid spacing)
> >>>
> >>> 8-10-12 PME grid spacing ~ 1 A
> >>>
> >>> Have you tried giving it just 1 or 2 GPUs alone (using the
> >>> +devices)?
> >>>
> >>>
> >>> Yes, this is the benchmark time:
> >>>
> >>> np 1: 0.48615 s/step
> >>> np 2: 0.26105 s/step
> >>> np 4: 0.14542 s/step
> >>> np 6: 0.10167 s/step
> >>>
> >>> I post here also part of the log running on 6 devices (in case it
> is
> >> helpful to
> >>> localize the problem):
> >>>
> >>> Pe 4 has 57 local and 64 remote patches and 1066 local and 473
> remote
> >> computes.
> >>> Pe 1 has 57 local and 65 remote patches and 1057 local and 482
> remote
> >> computes.
> >>> Pe 5 has 57 local and 56 remote patches and 1150 local and 389
> remote
> >> computes.
> >>> Pe 2 has 57 local and 57 remote patches and 1052 local and 487
> remote
> >> computes.
> >>> Pe 3 has 58 local and 57 remote patches and 1079 local and 487
> remote
> >> computes.
> >>> Pe 0 has 57 local and 57 remote patches and 1144 local and 395
> remote
> >> computes.
> >>>
> >>> Gianluca
> >>>
> >>> Gianluca
> >>>
> >>> On Thu, 12 Jul 2012, Aron Broom wrote:
> >>>
> >>> have you tried the multicore build? I wonder if
> >> the
> >>> prebuilt
> >>> smp one is just not
> >>> working for you.
> >>>
> >>> On Thu, Jul 12, 2012 at 3:21 PM, Gianluca
> >> Interlandi
> >>> <gianluca_at_u.washington.edu>
> >>> wrote:
> >>> are other people also using those
> GPUs?
> >>>
> >>>
> >>> I don't think so since I reserved the entire
> node.
> >>>
> >>> What are the benchmark timings that you are
> >> given
> >>> after
> >>> ~1000
> >>> steps?
> >>>
> >>>
> >>> The benchmark time with 6 processes is 101 sec
> for
> >> 1000
> >>> steps. This is only
> >>> slightly faster than Trestles where I get 109 sec
> >> for
> >>> 1000
> >>> steps running on 16
> >>> CPUs. So, yes 6 GPUs on Forge are much faster
> than
> >> 6
> >>> cores on
> >>> Trestles, but in
> >>> terms of SUs it makes no difference, since on
> Forge
> >> I
> >>> still
> >>> have to reserve the
> >>> entire node (16 cores).
> >>>
> >>> Gianluca
> >>>
> >>> is some setup time.
> >>>
> >>> I often run a system of ~100,000 atoms, and
> I
> >>> generally
> >>> see an
> >>> order of magnitude
> >>> improvement in speed compared to the same
> >> number
> >>> of
> >>> cores without
> >>> the GPUs. I would
> >>> test the non-CUDA precompiled cude on your
> >> Forge
> >>> system
> >>> and see how
> >>> that compares, it
> >>> might be the fault of something other than
> >> CUDA.
> >>>
> >>> ~Aron
> >>>
> >>> On Thu, Jul 12, 2012 at 2:41 PM, Gianluca
> >>> Interlandi
> >>> <gianluca_at_u.washington.edu>
> >>> wrote:
> >>> Hi Aron,
> >>>
> >>> Thanks for the explanations. I don't
> >> know
> >>> whether
> >>> I'm doing
> >>> everything
> >>> right. I don't see any speed
> advantage
> >>> running on
> >>> the CUDA
> >>> cluster
> >>> (Forge) versus running on a non-CUDA
> >>> cluster.
> >>>
> >>> I did the following benchmarks on
> Forge
> >>> (the
> >>> system has
> >>> 127,000 atoms and
> >>> ran for 1000 steps):
> >>>
> >>> np 1: 506 sec
> >>> np 2: 281 sec
> >>> np 4: 163 sec
> >>> np 6: 136 sec
> >>> np 12: 218 sec
> >>>
> >>> On the other hand, running the same
> >> system
> >>> on 16
> >>> cores of
> >>> Trestles (AMD
> >>> Magny Cours) takes 129 sec. It seems
> >> that
> >>> I'm not
> >>> really
> >>> making good use
> >>> of SUs by running on the CUDA
> cluster.
> >> Or,
> >>> maybe
> >>> I'm doing
> >>> something
> >>> wrong? I'm using the ibverbs-smp-CUDA
> >>> pre-compiled version of
> >>> NAMD 2.9.
> >>>
> >>> Thanks,
> >>>
> >>> Gianluca
> >>>
> >>> On Tue, 10 Jul 2012, Aron Broom
> wrote:
> >>>
> >>> if it is truly just one node,
> you
> >> can
> >>> use
> >>> the
> >>> multicore-CUDA
> >>> version and avoid the
> >>> MPI charmrun stuff. Still, it
> >> boils
> >>> down
> >>> to much the
> >>> same
> >>> thing I think. If you do
> >>> what you've done below, you are
> >>> running one
> >>> job with 12
> >>> CPU
> >>> cores and all GPUs. If
> >>> you don't specify the +devices,
> >> NAMD
> >>> will
> >>> automatically
> >>> find
> >>> the available GPUs, so I
> >>> think the main benefit of
> >> specifying
> >>> them
> >>> is when you
> >>> are
> >>> running more than one job
> >>> and don't want the jobs sharing
> >> GPUs.
> >>>
> >>> I'm not sure you'll see great
> >> scaling
> >>> across 6 GPUs for
> >>> a
> >>> single job, but that would
> >>> be great if you did.
> >>>
> >>> ~Aron
> >>>
> >>> On Tue, Jul 10, 2012 at 1:14
> PM,
> >>> Gianluca
> >>> Interlandi
> >>> <gianluca_at_u.washington.edu>
> >>> wrote:
> >>> Hi,
> >>>
> >>> I have a question
> >> concerning
> >>> running
> >>> NAMD on a
> >>> CUDA
> >>> cluster.
> >>>
> >>> NCSA Forge has for
> example
> >> 6
> >>> CUDA
> >>> devices and 16
> >>> CPU
> >>> cores per node. If I
> >>> want to use all 6 CUDA
> >> devices
> >>> in a
> >>> node, how
> >>> many
> >>> processes is it
> >>> recommended to spawn? Do
> I
> >> need
> >>> to
> >>> specify
> >>> "+devices"?
> >>>
> >>> So, if for example I want
> >> to
> >>> spawn 12
> >>> processes,
> >>> do I
> >>> need to specify:
> >>>
> >>> charmrun +p12 -
> machinefile
> >>> $PBS_NODEFILE +devices
> >>> 0,1,2,3,4,5 namd2
> >>> +idlepoll
> >>>
> >>> Thanks,
> >>>
> >>> Gianluca
> >>>
> >>>
> >>>
> >>> -----------------------------------------------------
> >>> Gianluca Interlandi, PhD
> >>> gianluca_at_u.washington.edu
> >>> +1
> >> (206)
> >>> 685 4435
> >>>
> >>>
> >>> http://artemide.bioeng.washington.edu/
> >>>
> >>> Research Scientist at the
> >>> Department
> >>> of
> >>> Bioengineering
> >>> at the University of
> >>> Washington,
> >>> Seattle WA
> >>> U.S.A.
> >>>
> >>>
> >>> -----------------------------------------------------
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> Aron Broom M.Sc
> >>> PhD Student
> >>> Department of Chemistry
> >>> University of Waterloo
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> -------------------------------------------------
> --
> >> --
> >>> Gianluca Interlandi, PhD
> >>> gianluca_at_u.washington.edu
> >>> +1 (206) 685 4435
> >>>
> >>> http://artemide.bioeng.washington.edu/
> >>>
> >>> Research Scientist at the Department
> of
> >>> Bioengineering
> >>> at the University of Washington,
> >> Seattle WA
> >>> U.S.A.
> >>>
> >>> -------------------------------------------------
> --
> >> --
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> Aron Broom M.Sc
> >>> PhD Student
> >>> Department of Chemistry
> >>> University of Waterloo
> >>>
> >>>
> >>>
> >>>
> >>> -------------------------------------------------
> --
> >> --
> >>> Gianluca Interlandi, PhD
> gianluca_at_u.washington.edu
> >>> +1 (206) 685 4435
> >>>
> >>> http://artemide.bioeng.washington.edu/
> >>>
> >>> Research Scientist at the Department of
> >> Bioengineering
> >>> at the University of Washington, Seattle WA
> U.S.A.
> >>> -------------------------------------------------
> --
> >> --
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> Aron Broom M.Sc
> >>> PhD Student
> >>> Department of Chemistry
> >>> University of Waterloo
> >>>
> >>>
> >>>
> >>>
> >>> -----------------------------------------------------
> >>> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
> >>> +1 (206) 685 4435
> >>>
> >> http://artemide.bioeng.washington.edu/
> >>>
> >>> Research Scientist at the Department of Bioengineering
> >>> at the University of Washington, Seattle WA U.S.A.
> >>> -----------------------------------------------------
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> Aron Broom M.Sc
> >>> PhD Student
> >>> Department of Chemistry
> >>> University of Waterloo
> >>>
> >>>
> >>>
> >>>
> >>> -----------------------------------------------------
> >>> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
> >>> +1 (206) 685 4435
> >>> http://artemide.bioeng.washington.edu/
> >>>
> >>> Research Scientist at the Department of Bioengineering
> >>> at the University of Washington, Seattle WA U.S.A.
> >>> -----------------------------------------------------
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> Aron Broom M.Sc
> >>> PhD Student
> >>> Department of Chemistry
> >>> University of Waterloo
> >>>
> >>>
> >>>
> >>
> >> -----------------------------------------------------
> >> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
> >> +1 (206) 685 4435
> >> http://artemide.bioeng.washington.edu/
> >>
> >> Research Scientist at the Department of Bioengineering
> >> at the University of Washington, Seattle WA U.S.A.
> >> -----------------------------------------------------
> >
> >
>
> -----------------------------------------------------
> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
> +1 (206) 685 4435
> http://artemide.bioeng.washington.edu/
>
> Research Scientist at the Department of Bioengineering
> at the University of Washington, Seattle WA U.S.A.
> -----------------------------------------------------
This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:47 CST