Re: AW: AW: Running NAMD on Forge (CUDA)

From: Axel Kohlmeyer (akohlmey_at_gmail.com)
Date: Mon Jul 16 2012 - 02:28:25 CDT

On Mon, Jul 16, 2012 at 9:11 AM, Norman Geist
<norman.geist_at_uni-greifswald.de> wrote:
> Hi again,
>
> I use stepspercycle 20 and fullelectfrequency 4.
>
> Also I get about 6x speedup compared to cpu only. (Two Tesla c2060 per node
> and each shared by 6 cores)
>
> Also see, that 6 GPUs per node is not optimal configuration as the PCI-E
> bandwidth is shared by all GPUs.

the situation on forge is even worse.
the GPUs are in a dell "GPU enclosure",
so that multiple GPUs (i think it is configured
as 2) are behind a bridge chip that connects
them to a single PCI-E slot. similar to
multi-GPU cards. on top of that, you have
AMD opteron CPUs, that have a 2-channel
memory controller unlike recent intel xeon
with 3 (westmere) or 4 (sandy bridge) channels.
..and you have a CPU core to GPU imbalance
by having one CPU carry two pairs of GPUs,
while the other has one pair and the infiniband
controller. originally, the machine was intended
to have 8 GPUs/node configured, but that
was changed to give more PCI-e bandwidth
to the IB-HCA.

in short, forge is best for GPU codes that
run entirely on the GPU and only need
occasional communication. not exactly
what NAMD needs to run well. for running
NAMD it would have been more effective
to not spend any money on GPUs and
get more CPUs instead. :-(

axel.

>
>
> Norman Geist.
>
>> -----Ursprüngliche Nachricht-----
>> Von: Gianluca Interlandi [mailto:gianluca_at_u.washington.edu]
>> Gesendet: Sonntag, 15. Juli 2012 03:59
>> An: Norman Geist
>> Cc: Namd Mailing List
>> Betreff: Re: AW: AW: namd-l: Running NAMD on Forge (CUDA)
>>
>> Hi Norman,
>>
>> > Ok, then it's 1 I guess. This is bad for GPU simulations as the
>> > electrostatic is done on the cpu. This causes much traffic between
>> cpu and
>> > gpu and messes up the PCI-E. Additionally 6 GPU's I could imagine do
>> also
>> > need a lot of PCI-E bandwidth, so it's likely that the performance of
>> the
>> > GPUs is not as expected. You should try to set fullelectfrequency to
>> at
>> > least 4 and try out the new molly parameter. This should cause less
>> traffic
>> > on PCI-E and improve the GPUs utilization but does little harm the
>> energy
>> > conservation what shows up as slightly increasing temperature. But
>> with the
>> > molly parameter it should be ok I think.
>>
>> I followed your recommendation. Now it runs almost twice as fast on 6
>> CUDA
>> compared with the configuration without molly and no multistepping. I
>> get
>> 0.06 sec/step (versus 0.1 sec/step). On the other hand, running on the
>> 16
>> CPUs with the same configuration takes 0.12 sec/step. So, I get a speed
>> up
>> of 2x with CUDA (6 CUDA vs 16 CPU cores). As a comparision, I get 0.08
>> sec/step on 4 CUDA devices and 0.14 sec/step on 2 devices, 0.25
>> sec/step
>> on 1 device.
>>
>> To be honest, I was expecting a lot more from CUDA. It seems that one
>> M2070 (0.25 sec/step) is almost equivalent to the performace of one 8-
>> core
>> magny cours CPU (0.22 sec/step). Or maybe it's just because CPU
>> manufacturers have caught up, as I already mentioned.
>>
>> Gianluca
>>
>> >>> How many GPUs are there per node in this cluster?
>> >>
>> >> 6
>> >>
>> >>> What kind of interconnect?
>> >>
>> >> Infiniband.
>> >
>> > Please make sure if you are running over multiple nodes, that you
>> make use
>> > of the infiniband interconnect. Therefore you need a ibverbs binary
>> of NAMD
>> > or there must be IPoIB installed. You can see if IPoIB is working if
>> there
>> > is a ib0 interface for example when you do ifconfig. Also as I
>> observed,
>> > IPoIB should be configured with the connected mode and a mtu of about
>> 65520
>> > (cat /sys/class/net/ib0/mode or mtu to see the current settings)
>> >
>> >>
>> >> Here are all specs:
>> >>
>> >>
>> http://www.ncsa.illinois.edu/UserInfo/Resources/Hardware/DellNVIDIAClus
>> >> ter/TechSummary/index.html
>> >>
>> >> Thanks,
>> >>
>> >> Gianluca
>> >>
>> >>> Norman Geist.
>> >>>
>> >>>> -----Ursprüngliche Nachricht-----
>> >>>> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
>> >>>> Auftrag von Gianluca Interlandi
>> >>>> Gesendet: Freitag, 13. Juli 2012 00:26
>> >>>> An: Aron Broom
>> >>>> Cc: NAMD list
>> >>>> Betreff: Re: namd-l: Running NAMD on Forge (CUDA)
>> >>>>
>> >>>> Yes, I was totally surprised, too. I also ran a non-CUDA job on
>> >> Forge
>> >>>> using 16 CPUs. I got 0.122076 s/step, which is 16% slower than
>> using
>> >>>> the 6
>> >>>> GPUs (0.1 s/step) and a bit slower than 0.10932 s/step that I get
>> on
>> >>>> Trestles using 16 cores. This difference might be statistical
>> >>>> fluctuations
>> >>>> though (or configuration setup) since Forge and Trestles have the
>> >> exact
>> >>>> same CPU, i.e., eight-core 2.4 GHz Magny-Cours.
>> >>>>
>> >>>> Yes, Forge also uses NVIDIA M2070.
>> >>>>>> I keep thinking of this guy here in Seattle who works for NVIDIA
>> >>>> downtown
>> >>>> and a few years ago he asked me: "How come you don't use CUDA?"
>> >> Maybe
>> >>>> the
>> >>>> code still needs some optimization, and CPU manufacturers have
>> been
>> >>>> doing
>> >>>> everything to catch up.
>> >>>>
>> >>>> Gianluca
>> >>>>
>> >>>> On Thu, 12 Jul 2012, Aron Broom wrote:
>> >>>>
>> >>>>> So your speed for 1 or 2 GPUs (based on what your sent) is about
>> >> 1.7
>> >>>> ns/day, which
>> >>>>> seems decent given the system size. I was getting 2.0 and 2.6
>> >> ns/day
>> >>>> for a 100k atom
>> >>>>> system with roughly those same parameters (and also 6-cpu cores),
>> >> so
>> >>>> given a scaling
>> >>>>> of ~nlogn, I would expect to see ~1.5 to 2.0 ns/day for you. So
>> in
>> >>>> my mind, the
>> >>>>> speed you are getting with the GPUs isn't so surprising, it's
>> that
>> >>>> you get such a
>> >>>>> good speed with only the CPUs that shocks me. In my case I
>> didn't
>> >>>> see speeds
>> >>>>> matching my 1 GPU until 48 CPU cores alone. Seems like those
>> Magny
>> >>>> Cours are pretty
>> >>>>> awesome.
>> >>>>>
>> >>>>> Which GPUs are you using? I was using mainly the M2070s.
>> >>>>>
>> >>>>> Also, one thing that might be useful, if you are able to get
>> >> roughly
>> >>>> the same speed
>> >>>>> with 6 cores and 2 GPUs and you get with 16 cores alone, is to
>> test
>> >>>> running 3 jobs at
>> >>>>> once, with 5 cores and 2 GPUs assigned to each and see how much
>> >>>> slowdown there is.
>> >>>>> You might be able to benefit from various replica techniques more
>> >>>> than just hitting a
>> >>>>> single job with more power.
>> >>>>>
>> >>>>> Still, the overall conclusion from what you've got seems to be
>> that
>> >>>> it makes more
>> >>>>> sense to go with more of those CPUs rather than putting GPUs in
>> >>>> there.
>> >>>>>
>> >>>>> ~Aron
>> >>>>>
>> >>>>> On Thu, Jul 12, 2012 at 4:58 PM, Gianluca Interlandi
>> >>>> <gianluca_at_u.washington.edu>
>> >>>>> wrote:
>> >>>>> What are your simulation parameters:
>> >>>>>
>> >>>>> timestep (and also any multistepping values)
>> >>>>>
>> >>>>> 2 fs, SHAKE, no multistepping
>> >>>>>
>> >>>>> cutoff (and also the pairlist and PME grid spacing)
>> >>>>>
>> >>>>> 8-10-12 PME grid spacing ~ 1 A
>> >>>>>
>> >>>>> Have you tried giving it just 1 or 2 GPUs alone (using the
>> >>>>> +devices)?
>> >>>>>
>> >>>>>
>> >>>>> Yes, this is the benchmark time:
>> >>>>>
>> >>>>> np 1: 0.48615 s/step
>> >>>>> np 2: 0.26105 s/step
>> >>>>> np 4: 0.14542 s/step
>> >>>>> np 6: 0.10167 s/step
>> >>>>>
>> >>>>> I post here also part of the log running on 6 devices (in case it
>> >> is
>> >>>> helpful to
>> >>>>> localize the problem):
>> >>>>>
>> >>>>> Pe 4 has 57 local and 64 remote patches and 1066 local and 473
>> >> remote
>> >>>> computes.
>> >>>>> Pe 1 has 57 local and 65 remote patches and 1057 local and 482
>> >> remote
>> >>>> computes.
>> >>>>> Pe 5 has 57 local and 56 remote patches and 1150 local and 389
>> >> remote
>> >>>> computes.
>> >>>>> Pe 2 has 57 local and 57 remote patches and 1052 local and 487
>> >> remote
>> >>>> computes.
>> >>>>> Pe 3 has 58 local and 57 remote patches and 1079 local and 487
>> >> remote
>> >>>> computes.
>> >>>>> Pe 0 has 57 local and 57 remote patches and 1144 local and 395
>> >> remote
>> >>>> computes.
>> >>>>>
>> >>>>> Gianluca
>> >>>>>
>> >>>>> Gianluca
>> >>>>>
>> >>>>> On Thu, 12 Jul 2012, Aron Broom wrote:
>> >>>>>
>> >>>>> have you tried the multicore build? I wonder
>> if
>> >>>> the
>> >>>>> prebuilt
>> >>>>> smp one is just not
>> >>>>> working for you.
>> >>>>>
>> >>>>> On Thu, Jul 12, 2012 at 3:21 PM, Gianluca
>> >>>> Interlandi
>> >>>>> <gianluca_at_u.washington.edu>
>> >>>>> wrote:
>> >>>>> are other people also using those
>> >> GPUs?
>> >>>>>
>> >>>>>
>> >>>>> I don't think so since I reserved the entire
>> >> node.
>> >>>>>
>> >>>>> What are the benchmark timings that you
>> are
>> >>>> given
>> >>>>> after
>> >>>>> ~1000
>> >>>>> steps?
>> >>>>>
>> >>>>>
>> >>>>> The benchmark time with 6 processes is 101 sec
>> >> for
>> >>>> 1000
>> >>>>> steps. This is only
>> >>>>> slightly faster than Trestles where I get 109
>> sec
>> >>>> for
>> >>>>> 1000
>> >>>>> steps running on 16
>> >>>>> CPUs. So, yes 6 GPUs on Forge are much faster
>> >> than
>> >>>> 6
>> >>>>> cores on
>> >>>>> Trestles, but in
>> >>>>> terms of SUs it makes no difference, since on
>> >> Forge
>> >>>> I
>> >>>>> still
>> >>>>> have to reserve the
>> >>>>> entire node (16 cores).
>> >>>>>
>> >>>>> Gianluca
>> >>>>>
>> >>>>> is some setup time.
>> >>>>>
>> >>>>> I often run a system of ~100,000 atoms,
>> and
>> >> I
>> >>>>> generally
>> >>>>> see an
>> >>>>> order of magnitude
>> >>>>> improvement in speed compared to the same
>> >>>> number
>> >>>>> of
>> >>>>> cores without
>> >>>>> the GPUs. I would
>> >>>>> test the non-CUDA precompiled cude on
>> your
>> >>>> Forge
>> >>>>> system
>> >>>>> and see how
>> >>>>> that compares, it
>> >>>>> might be the fault of something other
>> than
>> >>>> CUDA.
>> >>>>>
>> >>>>> ~Aron
>> >>>>>
>> >>>>> On Thu, Jul 12, 2012 at 2:41 PM, Gianluca
>> >>>>> Interlandi
>> >>>>> <gianluca_at_u.washington.edu>
>> >>>>> wrote:
>> >>>>> Hi Aron,
>> >>>>>
>> >>>>> Thanks for the explanations. I
>> don't
>> >>>> know
>> >>>>> whether
>> >>>>> I'm doing
>> >>>>> everything
>> >>>>> right. I don't see any speed
>> >> advantage
>> >>>>> running on
>> >>>>> the CUDA
>> >>>>> cluster
>> >>>>> (Forge) versus running on a non-
>> CUDA
>> >>>>> cluster.
>> >>>>>
>> >>>>> I did the following benchmarks on
>> >> Forge
>> >>>>> (the
>> >>>>> system has
>> >>>>> 127,000 atoms and
>> >>>>> ran for 1000 steps):
>> >>>>>
>> >>>>> np 1: 506 sec
>> >>>>> np 2: 281 sec
>> >>>>> np 4: 163 sec
>> >>>>> np 6: 136 sec
>> >>>>> np 12: 218 sec
>> >>>>>
>> >>>>> On the other hand, running the same
>> >>>> system
>> >>>>> on 16
>> >>>>> cores of
>> >>>>> Trestles (AMD
>> >>>>> Magny Cours) takes 129 sec. It
>> seems
>> >>>> that
>> >>>>> I'm not
>> >>>>> really
>> >>>>> making good use
>> >>>>> of SUs by running on the CUDA
>> >> cluster.
>> >>>> Or,
>> >>>>> maybe
>> >>>>> I'm doing
>> >>>>> something
>> >>>>> wrong? I'm using the ibverbs-smp-
>> CUDA
>> >>>>> pre-compiled version of
>> >>>>> NAMD 2.9.
>> >>>>>
>> >>>>> Thanks,
>> >>>>>
>> >>>>> Gianluca
>> >>>>>
>> >>>>> On Tue, 10 Jul 2012, Aron Broom
>> >> wrote:
>> >>>>>
>> >>>>> if it is truly just one node,
>> >> you
>> >>>> can
>> >>>>> use
>> >>>>> the
>> >>>>> multicore-CUDA
>> >>>>> version and avoid the
>> >>>>> MPI charmrun stuff. Still,
>> it
>> >>>> boils
>> >>>>> down
>> >>>>> to much the
>> >>>>> same
>> >>>>> thing I think. If you do
>> >>>>> what you've done below, you
>> are
>> >>>>> running one
>> >>>>> job with 12
>> >>>>> CPU
>> >>>>> cores and all GPUs. If
>> >>>>> you don't specify the
>> +devices,
>> >>>> NAMD
>> >>>>> will
>> >>>>> automatically
>> >>>>> find
>> >>>>> the available GPUs, so I
>> >>>>> think the main benefit of
>> >>>> specifying
>> >>>>> them
>> >>>>> is when you
>> >>>>> are
>> >>>>> running more than one job
>> >>>>> and don't want the jobs
>> sharing
>> >>>> GPUs.
>> >>>>>
>> >>>>> I'm not sure you'll see great
>> >>>> scaling
>> >>>>> across 6 GPUs for
>> >>>>> a
>> >>>>> single job, but that would
>> >>>>> be great if you did.
>> >>>>>
>> >>>>> ~Aron
>> >>>>>
>> >>>>> On Tue, Jul 10, 2012 at 1:14
>> >> PM,
>> >>>>> Gianluca
>> >>>>> Interlandi
>> >>>>> <gianluca_at_u.washington.edu>
>> >>>>> wrote:
>> >>>>> Hi,
>> >>>>>
>> >>>>> I have a question
>> >>>> concerning
>> >>>>> running
>> >>>>> NAMD on a
>> >>>>> CUDA
>> >>>>> cluster.
>> >>>>>
>> >>>>> NCSA Forge has for
>> >> example
>> >>>> 6
>> >>>>> CUDA
>> >>>>> devices and 16
>> >>>>> CPU
>> >>>>> cores per node. If I
>> >>>>> want to use all 6 CUDA
>> >>>> devices
>> >>>>> in a
>> >>>>> node, how
>> >>>>> many
>> >>>>> processes is it
>> >>>>> recommended to spawn?
>> Do
>> >> I
>> >>>> need
>> >>>>> to
>> >>>>> specify
>> >>>>> "+devices"?
>> >>>>>
>> >>>>> So, if for example I
>> want
>> >>>> to
>> >>>>> spawn 12
>> >>>>> processes,
>> >>>>> do I
>> >>>>> need to specify:
>> >>>>>
>> >>>>> charmrun +p12 -
>> >> machinefile
>> >>>>> $PBS_NODEFILE +devices
>> >>>>> 0,1,2,3,4,5 namd2
>> >>>>> +idlepoll
>> >>>>>
>> >>>>> Thanks,
>> >>>>>
>> >>>>> Gianluca
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> -----------------------------------------------------
>> >>>>> Gianluca Interlandi,
>> PhD
>> >>>>> gianluca_at_u.washington.edu
>> >>>>> +1
>> >>>> (206)
>> >>>>> 685 4435
>> >>>>>
>> >>>>>
>> >>>>> http://artemide.bioeng.washington.edu/
>> >>>>>
>> >>>>> Research Scientist at
>> the
>> >>>>> Department
>> >>>>> of
>> >>>>> Bioengineering
>> >>>>> at the University of
>> >>>>> Washington,
>> >>>>> Seattle WA
>> >>>>> U.S.A.
>> >>>>>
>> >>>>>
>> >>>>> -----------------------------------------------------
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> Aron Broom M.Sc
>> >>>>> PhD Student
>> >>>>> Department of Chemistry
>> >>>>> University of Waterloo
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> -----------------------------------------------
>> --
>> >> --
>> >>>> --
>> >>>>> Gianluca Interlandi, PhD
>> >>>>> gianluca_at_u.washington.edu
>> >>>>> +1 (206) 685
>> 4435
>> >>>>>
>> >>>>> http://artemide.bioeng.washington.edu/
>> >>>>>
>> >>>>> Research Scientist at the
>> Department
>> >> of
>> >>>>> Bioengineering
>> >>>>> at the University of Washington,
>> >>>> Seattle WA
>> >>>>> U.S.A.
>> >>>>>
>> >>>>> -----------------------------------------------
>> --
>> >> --
>> >>>> --
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> Aron Broom M.Sc
>> >>>>> PhD Student
>> >>>>> Department of Chemistry
>> >>>>> University of Waterloo
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> -----------------------------------------------
>> --
>> >> --
>> >>>> --
>> >>>>> Gianluca Interlandi, PhD
>> >> gianluca_at_u.washington.edu
>> >>>>> +1 (206) 685 4435
>> >>>>>
>> >>>>> http://artemide.bioeng.washington.edu/
>> >>>>>
>> >>>>> Research Scientist at the Department of
>> >>>> Bioengineering
>> >>>>> at the University of Washington, Seattle WA
>> >> U.S.A.
>> >>>>> -----------------------------------------------
>> --
>> >> --
>> >>>> --
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> Aron Broom M.Sc
>> >>>>> PhD Student
>> >>>>> Department of Chemistry
>> >>>>> University of Waterloo
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> -----------------------------------------------------
>> >>>>> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
>> >>>>> +1 (206) 685 4435
>> >>>>>
>> >>>> http://artemide.bioeng.washington.edu/
>> >>>>>
>> >>>>> Research Scientist at the Department of
>> Bioengineering
>> >>>>> at the University of Washington, Seattle WA U.S.A.
>> >>>>> -----------------------------------------------------
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> Aron Broom M.Sc
>> >>>>> PhD Student
>> >>>>> Department of Chemistry
>> >>>>> University of Waterloo
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> -----------------------------------------------------
>> >>>>> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
>> >>>>> +1 (206) 685 4435
>> >>>>> http://artemide.bioeng.washington.edu/
>> >>>>>
>> >>>>> Research Scientist at the Department of Bioengineering
>> >>>>> at the University of Washington, Seattle WA U.S.A.
>> >>>>> -----------------------------------------------------
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> Aron Broom M.Sc
>> >>>>> PhD Student
>> >>>>> Department of Chemistry
>> >>>>> University of Waterloo
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>
>> >>>> -----------------------------------------------------
>> >>>> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
>> >>>> +1 (206) 685 4435
>> >>>> http://artemide.bioeng.washington.edu/
>> >>>>
>> >>>> Research Scientist at the Department of Bioengineering
>> >>>> at the University of Washington, Seattle WA U.S.A.
>> >>>> -----------------------------------------------------
>> >>>
>> >>>
>> >>
>> >> -----------------------------------------------------
>> >> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
>> >> +1 (206) 685 4435
>> >> http://artemide.bioeng.washington.edu/
>> >>
>> >> Research Scientist at the Department of Bioengineering
>> >> at the University of Washington, Seattle WA U.S.A.
>> >> -----------------------------------------------------
>> >
>> >
>>
>> -----------------------------------------------------
>> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
>> +1 (206) 685 4435
>> http://artemide.bioeng.washington.edu/
>>
>> Research Scientist at the Department of Bioengineering
>> at the University of Washington, Seattle WA U.S.A.
>> -----------------------------------------------------
>
>

-- 
Dr. Axel Kohlmeyer  akohlmey_at_gmail.com  http://goo.gl/1wk0
International Centre for Theoretical Physics, Trieste. Italy.

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:47 CST