Re: AW: AW: Running NAMD on Forge (CUDA)

From: Gianluca Interlandi (gianluca_at_u.washington.edu)
Date: Sat Jul 14 2012 - 20:59:05 CDT

Next message: Roy Fernando: "Atoms Gone Missing after Generalized Born Implicit Solvent Equilibration"
Previous message: Aron Broom: "Re: gentoo ebuild"
Maybe in reply to: Norman Geist: "AW: AW: Running NAMD on Forge (CUDA)"
Next in thread: Norman Geist: "AW: AW: AW: Running NAMD on Forge (CUDA)"
Reply: Norman Geist: "AW: AW: AW: Running NAMD on Forge (CUDA)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

Hi Norman,

> Ok, then it's 1 I guess. This is bad for GPU simulations as the
> electrostatic is done on the cpu. This causes much traffic between cpu and
> gpu and messes up the PCI-E. Additionally 6 GPU's I could imagine do also
> need a lot of PCI-E bandwidth, so it's likely that the performance of the
> GPUs is not as expected. You should try to set fullelectfrequency to at
> least 4 and try out the new molly parameter. This should cause less traffic
> on PCI-E and improve the GPUs utilization but does little harm the energy
> conservation what shows up as slightly increasing temperature. But with the
> molly parameter it should be ok I think.

I followed your recommendation. Now it runs almost twice as fast on 6 CUDA
compared with the configuration without molly and no multistepping. I get
0.06 sec/step (versus 0.1 sec/step). On the other hand, running on the 16
CPUs with the same configuration takes 0.12 sec/step. So, I get a speed up
of 2x with CUDA (6 CUDA vs 16 CPU cores). As a comparision, I get 0.08
sec/step on 4 CUDA devices and 0.14 sec/step on 2 devices, 0.25 sec/step
on 1 device.

To be honest, I was expecting a lot more from CUDA. It seems that one
M2070 (0.25 sec/step) is almost equivalent to the performace of one 8-core
magny cours CPU (0.22 sec/step). Or maybe it's just because CPU
manufacturers have caught up, as I already mentioned.

Gianluca

>>> How many GPUs are there per node in this cluster?
>>
>> 6
>>
>>> What kind of interconnect?
>>
>> Infiniband.
>
> Please make sure if you are running over multiple nodes, that you make use
> of the infiniband interconnect. Therefore you need a ibverbs binary of NAMD
> or there must be IPoIB installed. You can see if IPoIB is working if there
> is a ib0 interface for example when you do ifconfig. Also as I observed,
> IPoIB should be configured with the connected mode and a mtu of about 65520
> (cat /sys/class/net/ib0/mode or mtu to see the current settings)
>
>>
>> Here are all specs:
>>
>> http://www.ncsa.illinois.edu/UserInfo/Resources/Hardware/DellNVIDIAClus
>> ter/TechSummary/index.html
>>
>> Thanks,
>>
>> Gianluca
>>
>>> Norman Geist.
>>>
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
>>>> Auftrag von Gianluca Interlandi
>>>> Gesendet: Freitag, 13. Juli 2012 00:26
>>>> An: Aron Broom
>>>> Cc: NAMD list
>>>> Betreff: Re: namd-l: Running NAMD on Forge (CUDA)
>>>>
>>>> Yes, I was totally surprised, too. I also ran a non-CUDA job on
>> Forge
>>>> using 16 CPUs. I got 0.122076 s/step, which is 16% slower than using
>>>> the 6
>>>> GPUs (0.1 s/step) and a bit slower than 0.10932 s/step that I get on
>>>> Trestles using 16 cores. This difference might be statistical
>>>> fluctuations
>>>> though (or configuration setup) since Forge and Trestles have the
>> exact
>>>> same CPU, i.e., eight-core 2.4 GHz Magny-Cours.
>>>>
>>>> Yes, Forge also uses NVIDIA M2070.
>>>>>> I keep thinking of this guy here in Seattle who works for NVIDIA
>>>> downtown
>>>> and a few years ago he asked me: "How come you don't use CUDA?"
>> Maybe
>>>> the
>>>> code still needs some optimization, and CPU manufacturers have been
>>>> doing
>>>> everything to catch up.
>>>>
>>>> Gianluca
>>>>
>>>> On Thu, 12 Jul 2012, Aron Broom wrote:
>>>>
>>>>> So your speed for 1 or 2 GPUs (based on what your sent) is about
>> 1.7
>>>> ns/day, which
>>>>> seems decent given the system size. I was getting 2.0 and 2.6
>> ns/day
>>>> for a 100k atom
>>>>> system with roughly those same parameters (and also 6-cpu cores),
>> so
>>>> given a scaling
>>>>> of ~nlogn, I would expect to see ~1.5 to 2.0 ns/day for you. So in
>>>> my mind, the
>>>>> speed you are getting with the GPUs isn't so surprising, it's that
>>>> you get such a
>>>>> good speed with only the CPUs that shocks me. In my case I didn't
>>>> see speeds
>>>>> matching my 1 GPU until 48 CPU cores alone. Seems like those Magny
>>>> Cours are pretty
>>>>> awesome.
>>>>>
>>>>> Which GPUs are you using? I was using mainly the M2070s.
>>>>>
>>>>> Also, one thing that might be useful, if you are able to get
>> roughly
>>>> the same speed
>>>>> with 6 cores and 2 GPUs and you get with 16 cores alone, is to test
>>>> running 3 jobs at
>>>>> once, with 5 cores and 2 GPUs assigned to each and see how much
>>>> slowdown there is.
>>>>> You might be able to benefit from various replica techniques more
>>>> than just hitting a
>>>>> single job with more power.
>>>>>
>>>>> Still, the overall conclusion from what you've got seems to be that
>>>> it makes more
>>>>> sense to go with more of those CPUs rather than putting GPUs in
>>>> there.
>>>>>
>>>>> ~Aron
>>>>>
>>>>> On Thu, Jul 12, 2012 at 4:58 PM, Gianluca Interlandi
>>>> <gianluca_at_u.washington.edu>
>>>>> wrote:
>>>>> What are your simulation parameters:
>>>>>
>>>>> timestep (and also any multistepping values)
>>>>>
>>>>> 2 fs, SHAKE, no multistepping
>>>>>
>>>>> cutoff (and also the pairlist and PME grid spacing)
>>>>>
>>>>> 8-10-12 PME grid spacing ~ 1 A
>>>>>
>>>>> Have you tried giving it just 1 or 2 GPUs alone (using the
>>>>> +devices)?
>>>>>
>>>>>
>>>>> Yes, this is the benchmark time:
>>>>>
>>>>> np 1: 0.48615 s/step
>>>>> np 2: 0.26105 s/step
>>>>> np 4: 0.14542 s/step
>>>>> np 6: 0.10167 s/step
>>>>>
>>>>> I post here also part of the log running on 6 devices (in case it
>> is
>>>> helpful to
>>>>> localize the problem):
>>>>>
>>>>> Pe 4 has 57 local and 64 remote patches and 1066 local and 473
>> remote
>>>> computes.
>>>>> Pe 1 has 57 local and 65 remote patches and 1057 local and 482
>> remote
>>>> computes.
>>>>> Pe 5 has 57 local and 56 remote patches and 1150 local and 389
>> remote
>>>> computes.
>>>>> Pe 2 has 57 local and 57 remote patches and 1052 local and 487
>> remote
>>>> computes.
>>>>> Pe 3 has 58 local and 57 remote patches and 1079 local and 487
>> remote
>>>> computes.
>>>>> Pe 0 has 57 local and 57 remote patches and 1144 local and 395
>> remote
>>>> computes.
>>>>>
>>>>> Gianluca
>>>>>
>>>>> Gianluca
>>>>>
>>>>> On Thu, 12 Jul 2012, Aron Broom wrote:
>>>>>
>>>>> have you tried the multicore build? I wonder if
>>>> the
>>>>> prebuilt
>>>>> smp one is just not
>>>>> working for you.
>>>>>
>>>>> On Thu, Jul 12, 2012 at 3:21 PM, Gianluca
>>>> Interlandi
>>>>> <gianluca_at_u.washington.edu>
>>>>> wrote:
>>>>> are other people also using those
>> GPUs?
>>>>>
>>>>>
>>>>> I don't think so since I reserved the entire
>> node.
>>>>>
>>>>> What are the benchmark timings that you are
>>>> given
>>>>> after
>>>>> ~1000
>>>>> steps?
>>>>>
>>>>>
>>>>> The benchmark time with 6 processes is 101 sec
>> for
>>>> 1000
>>>>> steps. This is only
>>>>> slightly faster than Trestles where I get 109 sec
>>>> for
>>>>> 1000
>>>>> steps running on 16
>>>>> CPUs. So, yes 6 GPUs on Forge are much faster
>> than
>>>> 6
>>>>> cores on
>>>>> Trestles, but in
>>>>> terms of SUs it makes no difference, since on
>> Forge
>>>> I
>>>>> still
>>>>> have to reserve the
>>>>> entire node (16 cores).
>>>>>
>>>>> Gianluca
>>>>>
>>>>> is some setup time.
>>>>>
>>>>> I often run a system of ~100,000 atoms, and
>> I
>>>>> generally
>>>>> see an
>>>>> order of magnitude
>>>>> improvement in speed compared to the same
>>>> number
>>>>> of
>>>>> cores without
>>>>> the GPUs. I would
>>>>> test the non-CUDA precompiled cude on your
>>>> Forge
>>>>> system
>>>>> and see how
>>>>> that compares, it
>>>>> might be the fault of something other than
>>>> CUDA.
>>>>>
>>>>> ~Aron
>>>>>
>>>>> On Thu, Jul 12, 2012 at 2:41 PM, Gianluca
>>>>> Interlandi
>>>>> <gianluca_at_u.washington.edu>
>>>>> wrote:
>>>>> Hi Aron,
>>>>>
>>>>> Thanks for the explanations. I don't
>>>> know
>>>>> whether
>>>>> I'm doing
>>>>> everything
>>>>> right. I don't see any speed
>> advantage
>>>>> running on
>>>>> the CUDA
>>>>> cluster
>>>>> (Forge) versus running on a non-CUDA
>>>>> cluster.
>>>>>
>>>>> I did the following benchmarks on
>> Forge
>>>>> (the
>>>>> system has
>>>>> 127,000 atoms and
>>>>> ran for 1000 steps):
>>>>>
>>>>> np 1: 506 sec
>>>>> np 2: 281 sec
>>>>> np 4: 163 sec
>>>>> np 6: 136 sec
>>>>> np 12: 218 sec
>>>>>
>>>>> On the other hand, running the same
>>>> system
>>>>> on 16
>>>>> cores of
>>>>> Trestles (AMD
>>>>> Magny Cours) takes 129 sec. It seems
>>>> that
>>>>> I'm not
>>>>> really
>>>>> making good use
>>>>> of SUs by running on the CUDA
>> cluster.
>>>> Or,
>>>>> maybe
>>>>> I'm doing
>>>>> something
>>>>> wrong? I'm using the ibverbs-smp-CUDA
>>>>> pre-compiled version of
>>>>> NAMD 2.9.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Gianluca
>>>>>
>>>>> On Tue, 10 Jul 2012, Aron Broom
>> wrote:
>>>>>
>>>>> if it is truly just one node,
>> you
>>>> can
>>>>> use
>>>>> the
>>>>> multicore-CUDA
>>>>> version and avoid the
>>>>> MPI charmrun stuff. Still, it
>>>> boils
>>>>> down
>>>>> to much the
>>>>> same
>>>>> thing I think. If you do
>>>>> what you've done below, you are
>>>>> running one
>>>>> job with 12
>>>>> CPU
>>>>> cores and all GPUs. If
>>>>> you don't specify the +devices,
>>>> NAMD
>>>>> will
>>>>> automatically
>>>>> find
>>>>> the available GPUs, so I
>>>>> think the main benefit of
>>>> specifying
>>>>> them
>>>>> is when you
>>>>> are
>>>>> running more than one job
>>>>> and don't want the jobs sharing
>>>> GPUs.
>>>>>
>>>>> I'm not sure you'll see great
>>>> scaling
>>>>> across 6 GPUs for
>>>>> a
>>>>> single job, but that would
>>>>> be great if you did.
>>>>>
>>>>> ~Aron
>>>>>
>>>>> On Tue, Jul 10, 2012 at 1:14
>> PM,
>>>>> Gianluca
>>>>> Interlandi
>>>>> <gianluca_at_u.washington.edu>
>>>>> wrote:
>>>>> Hi,
>>>>>
>>>>> I have a question
>>>> concerning
>>>>> running
>>>>> NAMD on a
>>>>> CUDA
>>>>> cluster.
>>>>>
>>>>> NCSA Forge has for
>> example
>>>> 6
>>>>> CUDA
>>>>> devices and 16
>>>>> CPU
>>>>> cores per node. If I
>>>>> want to use all 6 CUDA
>>>> devices
>>>>> in a
>>>>> node, how
>>>>> many
>>>>> processes is it
>>>>> recommended to spawn? Do
>> I
>>>> need
>>>>> to
>>>>> specify
>>>>> "+devices"?
>>>>>
>>>>> So, if for example I want
>>>> to
>>>>> spawn 12
>>>>> processes,
>>>>> do I
>>>>> need to specify:
>>>>>
>>>>> charmrun +p12 -
>> machinefile
>>>>> $PBS_NODEFILE +devices
>>>>> 0,1,2,3,4,5 namd2
>>>>> +idlepoll
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Gianluca
>>>>>
>>>>>
>>>>>
>>>>> -----------------------------------------------------
>>>>> Gianluca Interlandi, PhD
>>>>> gianluca_at_u.washington.edu
>>>>> +1
>>>> (206)
>>>>> 685 4435
>>>>>
>>>>>
>>>>> http://artemide.bioeng.washington.edu/
>>>>>
>>>>> Research Scientist at the
>>>>> Department
>>>>> of
>>>>> Bioengineering
>>>>> at the University of
>>>>> Washington,
>>>>> Seattle WA
>>>>> U.S.A.
>>>>>
>>>>>
>>>>> -----------------------------------------------------
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Aron Broom M.Sc
>>>>> PhD Student
>>>>> Department of Chemistry
>>>>> University of Waterloo
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -------------------------------------------------
>> --
>>>> --
>>>>> Gianluca Interlandi, PhD
>>>>> gianluca_at_u.washington.edu
>>>>> +1 (206) 685 4435
>>>>>
>>>>> http://artemide.bioeng.washington.edu/
>>>>>
>>>>> Research Scientist at the Department
>> of
>>>>> Bioengineering
>>>>> at the University of Washington,
>>>> Seattle WA
>>>>> U.S.A.
>>>>>
>>>>> -------------------------------------------------
>> --
>>>> --
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Aron Broom M.Sc
>>>>> PhD Student
>>>>> Department of Chemistry
>>>>> University of Waterloo
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -------------------------------------------------
>> --
>>>> --
>>>>> Gianluca Interlandi, PhD
>> gianluca_at_u.washington.edu
>>>>> +1 (206) 685 4435
>>>>>
>>>>> http://artemide.bioeng.washington.edu/
>>>>>
>>>>> Research Scientist at the Department of
>>>> Bioengineering
>>>>> at the University of Washington, Seattle WA
>> U.S.A.
>>>>> -------------------------------------------------
>> --
>>>> --
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Aron Broom M.Sc
>>>>> PhD Student
>>>>> Department of Chemistry
>>>>> University of Waterloo
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -----------------------------------------------------
>>>>> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
>>>>> +1 (206) 685 4435
>>>>>
>>>> http://artemide.bioeng.washington.edu/
>>>>>
>>>>> Research Scientist at the Department of Bioengineering
>>>>> at the University of Washington, Seattle WA U.S.A.
>>>>> -----------------------------------------------------
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Aron Broom M.Sc
>>>>> PhD Student
>>>>> Department of Chemistry
>>>>> University of Waterloo
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -----------------------------------------------------
>>>>> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
>>>>> +1 (206) 685 4435
>>>>> http://artemide.bioeng.washington.edu/
>>>>>
>>>>> Research Scientist at the Department of Bioengineering
>>>>> at the University of Washington, Seattle WA U.S.A.
>>>>> -----------------------------------------------------
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Aron Broom M.Sc
>>>>> PhD Student
>>>>> Department of Chemistry
>>>>> University of Waterloo
>>>>>
>>>>>
>>>>>
>>>>
>>>> -----------------------------------------------------
>>>> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
>>>> +1 (206) 685 4435
>>>> http://artemide.bioeng.washington.edu/
>>>>
>>>> Research Scientist at the Department of Bioengineering
>>>> at the University of Washington, Seattle WA U.S.A.
>>>> -----------------------------------------------------
>>>
>>>
>>
>> -----------------------------------------------------
>> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
>> +1 (206) 685 4435
>> http://artemide.bioeng.washington.edu/
>>
>> Research Scientist at the Department of Bioengineering
>> at the University of Washington, Seattle WA U.S.A.
>> -----------------------------------------------------
>
>

-----------------------------------------------------
Gianluca Interlandi, PhD gianluca_at_u.washington.edu
+1 (206) 685 4435
http://artemide.bioeng.washington.edu/

Research Scientist at the Department of Bioengineering
at the University of Washington, Seattle WA U.S.A.
-----------------------------------------------------

Next message: Roy Fernando: "Atoms Gone Missing after Generalized Born Implicit Solvent Equilibration"
Previous message: Aron Broom: "Re: gentoo ebuild"
Maybe in reply to: Norman Geist: "AW: AW: Running NAMD on Forge (CUDA)"
Next in thread: Norman Geist: "AW: AW: AW: Running NAMD on Forge (CUDA)"
Reply: Norman Geist: "AW: AW: AW: Running NAMD on Forge (CUDA)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:47 CST