Re: AW: AW: Running NAMD on Forge (CUDA)

From: Gianluca Interlandi (gianluca_at_u.washington.edu)
Date: Sat Jul 14 2012 - 20:59:05 CDT

Hi Norman,

> Ok, then it's 1 I guess. This is bad for GPU simulations as the
> electrostatic is done on the cpu. This causes much traffic between cpu and
> gpu and messes up the PCI-E. Additionally 6 GPU's I could imagine do also
> need a lot of PCI-E bandwidth, so it's likely that the performance of the
> GPUs is not as expected. You should try to set fullelectfrequency to at
> least 4 and try out the new molly parameter. This should cause less traffic
> on PCI-E and improve the GPUs utilization but does little harm the energy
> conservation what shows up as slightly increasing temperature. But with the
> molly parameter it should be ok I think.

I followed your recommendation. Now it runs almost twice as fast on 6 CUDA
compared with the configuration without molly and no multistepping. I get
0.06 sec/step (versus 0.1 sec/step). On the other hand, running on the 16
CPUs with the same configuration takes 0.12 sec/step. So, I get a speed up
of 2x with CUDA (6 CUDA vs 16 CPU cores). As a comparision, I get 0.08
sec/step on 4 CUDA devices and 0.14 sec/step on 2 devices, 0.25 sec/step
on 1 device.

To be honest, I was expecting a lot more from CUDA. It seems that one
M2070 (0.25 sec/step) is almost equivalent to the performace of one 8-core
magny cours CPU (0.22 sec/step). Or maybe it's just because CPU
manufacturers have caught up, as I already mentioned.

Gianluca

>>> How many GPUs are there per node in this cluster?
>>
>> 6
>>
>>> What kind of interconnect?
>>
>> Infiniband.
>
> Please make sure if you are running over multiple nodes, that you make use
> of the infiniband interconnect. Therefore you need a ibverbs binary of NAMD
> or there must be IPoIB installed. You can see if IPoIB is working if there
> is a ib0 interface for example when you do ifconfig. Also as I observed,
> IPoIB should be configured with the connected mode and a mtu of about 65520
> (cat /sys/class/net/ib0/mode or mtu to see the current settings)
>
>>
>> Here are all specs:
>>
>> http://www.ncsa.illinois.edu/UserInfo/Resources/Hardware/DellNVIDIAClus
>> ter/TechSummary/index.html
>>
>> Thanks,
>>
>> Gianluca
>>
>>> Norman Geist.
>>>
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
>>>> Auftrag von Gianluca Interlandi
>>>> Gesendet: Freitag, 13. Juli 2012 00:26
>>>> An: Aron Broom
>>>> Cc: NAMD list
>>>> Betreff: Re: namd-l: Running NAMD on Forge (CUDA)
>>>>
>>>> Yes, I was totally surprised, too. I also ran a non-CUDA job on
>> Forge
>>>> using 16 CPUs. I got 0.122076 s/step, which is 16% slower than using
>>>> the 6
>>>> GPUs (0.1 s/step) and a bit slower than 0.10932 s/step that I get on
>>>> Trestles using 16 cores. This difference might be statistical
>>>> fluctuations
>>>> though (or configuration setup) since Forge and Trestles have the
>> exact
>>>> same CPU, i.e., eight-core 2.4 GHz Magny-Cours.
>>>>
>>>> Yes, Forge also uses NVIDIA M2070.
>>>>>> I keep thinking of this guy here in Seattle who works for NVIDIA
>>>> downtown
>>>> and a few years ago he asked me: "How come you don't use CUDA?"
>> Maybe
>>>> the
>>>> code still needs some optimization, and CPU manufacturers have been
>>>> doing
>>>> everything to catch up.
>>>>
>>>> Gianluca
>>>>
>>>> On Thu, 12 Jul 2012, Aron Broom wrote:
>>>>
>>>>> So your speed for 1 or 2 GPUs (based on what your sent) is about
>> 1.7
>>>> ns/day, which
>>>>> seems decent given the system size.  I was getting 2.0 and 2.6
>> ns/day
>>>> for a 100k atom
>>>>> system with roughly those same parameters (and also 6-cpu cores),
>> so
>>>> given a scaling
>>>>> of ~nlogn, I would expect to see ~1.5 to 2.0 ns/day for you.  So in
>>>> my mind, the
>>>>> speed you are getting with the GPUs isn't so surprising, it's that
>>>> you get such a
>>>>> good speed with only the CPUs that shocks me.  In my case I didn't
>>>> see speeds
>>>>> matching my 1 GPU until 48 CPU cores alone.  Seems like those Magny
>>>> Cours are pretty
>>>>> awesome.
>>>>>
>>>>> Which GPUs are you using?  I was using mainly the M2070s.
>>>>>
>>>>> Also, one thing that might be useful, if you are able to get
>> roughly
>>>> the same speed
>>>>> with 6 cores and 2 GPUs and you get with 16 cores alone, is to test
>>>> running 3 jobs at
>>>>> once, with 5 cores and 2 GPUs assigned to each and see how much
>>>> slowdown there is.
>>>>> You might be able to benefit from various replica techniques more
>>>> than just hitting a
>>>>> single job with more power.
>>>>>
>>>>> Still, the overall conclusion from what you've got seems to be that
>>>> it makes more
>>>>> sense to go with more of those CPUs rather than putting GPUs in
>>>> there.
>>>>>
>>>>> ~Aron
>>>>>
>>>>> On Thu, Jul 12, 2012 at 4:58 PM, Gianluca Interlandi
>>>> <gianluca_at_u.washington.edu>
>>>>> wrote:
>>>>> What are your simulation parameters:
>>>>>
>>>>> timestep (and also any multistepping values)
>>>>>
>>>>> 2 fs, SHAKE, no multistepping
>>>>>
>>>>> cutoff (and also the pairlist and PME grid spacing)
>>>>>
>>>>> 8-10-12  PME grid spacing ~ 1 A
>>>>>
>>>>> Have you tried giving it just 1 or 2 GPUs alone (using the
>>>>> +devices)?
>>>>>
>>>>>
>>>>> Yes, this is the benchmark time:
>>>>>
>>>>> np 1:  0.48615 s/step
>>>>> np 2:  0.26105 s/step
>>>>> np 4:  0.14542 s/step
>>>>> np 6:  0.10167 s/step
>>>>>
>>>>> I post here also part of the log running on 6 devices (in case it
>> is
>>>> helpful to
>>>>> localize the problem):
>>>>>
>>>>> Pe 4 has 57 local and 64 remote patches and 1066 local and 473
>> remote
>>>> computes.
>>>>> Pe 1 has 57 local and 65 remote patches and 1057 local and 482
>> remote
>>>> computes.
>>>>> Pe 5 has 57 local and 56 remote patches and 1150 local and 389
>> remote
>>>> computes.
>>>>> Pe 2 has 57 local and 57 remote patches and 1052 local and 487
>> remote
>>>> computes.
>>>>> Pe 3 has 58 local and 57 remote patches and 1079 local and 487
>> remote
>>>> computes.
>>>>> Pe 0 has 57 local and 57 remote patches and 1144 local and 395
>> remote
>>>> computes.
>>>>>
>>>>> Gianluca
>>>>>
>>>>>       Gianluca
>>>>>
>>>>>       On Thu, 12 Jul 2012, Aron Broom wrote:
>>>>>
>>>>>             have you tried the multicore build?  I wonder if
>>>> the
>>>>> prebuilt
>>>>>             smp one is just not
>>>>>             working for you.
>>>>>
>>>>>             On Thu, Jul 12, 2012 at 3:21 PM, Gianluca
>>>> Interlandi
>>>>>             <gianluca_at_u.washington.edu>
>>>>>             wrote:
>>>>>                         are other people also using those
>> GPUs?
>>>>>
>>>>>
>>>>>             I don't think so since I reserved the entire
>> node.
>>>>>
>>>>>                   What are the benchmark timings that you are
>>>> given
>>>>> after
>>>>>             ~1000
>>>>>                   steps?
>>>>>
>>>>>
>>>>>             The benchmark time with 6 processes is 101 sec
>> for
>>>> 1000
>>>>>             steps. This is only
>>>>>             slightly faster than Trestles where I get 109 sec
>>>> for
>>>>> 1000
>>>>>             steps running on 16
>>>>>             CPUs. So, yes 6 GPUs on Forge are much faster
>> than
>>>> 6
>>>>> cores on
>>>>>             Trestles, but in
>>>>>             terms of SUs it makes no difference, since on
>> Forge
>>>> I
>>>>> still
>>>>>             have to reserve the
>>>>>             entire node (16 cores).
>>>>>
>>>>>             Gianluca
>>>>>
>>>>>                   is some setup time.
>>>>>
>>>>>                   I often run a system of ~100,000 atoms, and
>> I
>>>>> generally
>>>>>             see an
>>>>>                   order of magnitude
>>>>>                   improvement in speed compared to the same
>>>> number
>>>>> of
>>>>>             cores without
>>>>>                   the GPUs.  I would
>>>>>                   test the non-CUDA precompiled cude on your
>>>> Forge
>>>>> system
>>>>>             and see how
>>>>>                   that compares, it
>>>>>                   might be the fault of something other than
>>>> CUDA.
>>>>>
>>>>>                   ~Aron
>>>>>
>>>>>                   On Thu, Jul 12, 2012 at 2:41 PM, Gianluca
>>>>> Interlandi
>>>>>                   <gianluca_at_u.washington.edu>
>>>>>                   wrote:
>>>>>                         Hi Aron,
>>>>>
>>>>>                         Thanks for the explanations. I don't
>>>> know
>>>>> whether
>>>>>             I'm doing
>>>>>                   everything
>>>>>                         right. I don't see any speed
>> advantage
>>>>> running on
>>>>>             the CUDA
>>>>>                   cluster
>>>>>                         (Forge) versus running on a non-CUDA
>>>>> cluster.
>>>>>
>>>>>                         I did the following benchmarks on
>> Forge
>>>>> (the
>>>>>             system has
>>>>>                   127,000 atoms and
>>>>>                         ran for 1000 steps):
>>>>>
>>>>>                         np 1:  506 sec
>>>>>                         np 2:  281 sec
>>>>>                         np 4:  163 sec
>>>>>                         np 6:  136 sec
>>>>>                         np 12: 218 sec
>>>>>
>>>>>                         On the other hand, running the same
>>>> system
>>>>> on 16
>>>>>             cores of
>>>>>                   Trestles (AMD
>>>>>                         Magny Cours) takes 129 sec. It seems
>>>> that
>>>>> I'm not
>>>>>             really
>>>>>                   making good use
>>>>>                         of SUs by running on the CUDA
>> cluster.
>>>> Or,
>>>>> maybe
>>>>>             I'm doing
>>>>>                   something
>>>>>                         wrong? I'm using the ibverbs-smp-CUDA
>>>>>             pre-compiled version of
>>>>>                   NAMD 2.9.
>>>>>
>>>>>                         Thanks,
>>>>>
>>>>>                              Gianluca
>>>>>
>>>>>                         On Tue, 10 Jul 2012, Aron Broom
>> wrote:
>>>>>
>>>>>                               if it is truly just one node,
>> you
>>>> can
>>>>> use
>>>>>             the
>>>>>                   multicore-CUDA
>>>>>                               version and avoid the
>>>>>                               MPI charmrun stuff.  Still, it
>>>> boils
>>>>> down
>>>>>             to much the
>>>>>                   same
>>>>>                               thing I think.  If you do
>>>>>                               what you've done below, you are
>>>>> running one
>>>>>             job with 12
>>>>>                   CPU
>>>>>                               cores and all GPUs.  If
>>>>>                               you don't specify the +devices,
>>>> NAMD
>>>>> will
>>>>>             automatically
>>>>>                   find
>>>>>                               the available GPUs, so I
>>>>>                               think the main benefit of
>>>> specifying
>>>>> them
>>>>>             is when you
>>>>>                   are
>>>>>                               running more than one job
>>>>>                               and don't want the jobs sharing
>>>> GPUs.
>>>>>
>>>>>                               I'm not sure you'll see great
>>>> scaling
>>>>>             across 6 GPUs for
>>>>>                   a
>>>>>                               single job, but that would
>>>>>                               be great if you did.
>>>>>
>>>>>                               ~Aron
>>>>>
>>>>>                               On Tue, Jul 10, 2012 at 1:14
>> PM,
>>>>> Gianluca
>>>>>             Interlandi
>>>>>                               <gianluca_at_u.washington.edu>
>>>>>                               wrote:
>>>>>                                     Hi,
>>>>>
>>>>>                                     I have a question
>>>> concerning
>>>>> running
>>>>>             NAMD on a
>>>>>                   CUDA
>>>>>                               cluster.
>>>>>
>>>>>                                     NCSA Forge has for
>> example
>>>> 6
>>>>> CUDA
>>>>>             devices and 16
>>>>>                   CPU
>>>>>                               cores per node. If I
>>>>>                                     want to use all 6 CUDA
>>>> devices
>>>>> in a
>>>>>             node, how
>>>>>                   many
>>>>>                               processes is it
>>>>>                                     recommended to spawn? Do
>> I
>>>> need
>>>>> to
>>>>>             specify
>>>>>                   "+devices"?
>>>>>
>>>>>                                     So, if for example I want
>>>> to
>>>>> spawn 12
>>>>>             processes,
>>>>>                   do I
>>>>>                               need to specify:
>>>>>
>>>>>                                     charmrun +p12 -
>> machinefile
>>>>>             $PBS_NODEFILE +devices
>>>>>                               0,1,2,3,4,5 namd2
>>>>>                                     +idlepoll
>>>>>
>>>>>                                     Thanks,
>>>>>
>>>>>                                          Gianluca
>>>>>
>>>>>
>>>>>
>>>>> -----------------------------------------------------
>>>>>                                     Gianluca Interlandi, PhD
>>>>>                   gianluca_at_u.washington.edu
>>>>>                                                         +1
>>>> (206)
>>>>> 685 4435
>>>>>
>>>>>
>>>>> http://artemide.bioeng.washington.edu/
>>>>>
>>>>>                                     Research Scientist at the
>>>>> Department
>>>>>             of
>>>>>                   Bioengineering
>>>>>                                     at the University of
>>>>> Washington,
>>>>>             Seattle WA
>>>>>                   U.S.A.
>>>>>
>>>>>
>>>>> -----------------------------------------------------
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                               --
>>>>>                               Aron Broom M.Sc
>>>>>                               PhD Student
>>>>>                               Department of Chemistry
>>>>>                               University of Waterloo
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>             -------------------------------------------------
>> --
>>>> --
>>>>>                         Gianluca Interlandi, PhD
>>>>>             gianluca_at_u.washington.edu
>>>>>                                             +1 (206) 685 4435
>>>>>
>>>>>             http://artemide.bioeng.washington.edu/
>>>>>
>>>>>                         Research Scientist at the Department
>> of
>>>>>             Bioengineering
>>>>>                         at the University of Washington,
>>>> Seattle WA
>>>>>             U.S.A.
>>>>>
>>>>>             -------------------------------------------------
>> --
>>>> --
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                   --
>>>>>                   Aron Broom M.Sc
>>>>>                   PhD Student
>>>>>                   Department of Chemistry
>>>>>                   University of Waterloo
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>             -------------------------------------------------
>> --
>>>> --
>>>>>             Gianluca Interlandi, PhD
>> gianluca_at_u.washington.edu
>>>>>                                 +1 (206) 685 4435
>>>>>
>>>>> http://artemide.bioeng.washington.edu/
>>>>>
>>>>>             Research Scientist at the Department of
>>>> Bioengineering
>>>>>             at the University of Washington, Seattle WA
>> U.S.A.
>>>>>             -------------------------------------------------
>> --
>>>> --
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>             --
>>>>>             Aron Broom M.Sc
>>>>>             PhD Student
>>>>>             Department of Chemistry
>>>>>             University of Waterloo
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>       -----------------------------------------------------
>>>>>       Gianluca Interlandi, PhD gianluca_at_u.washington.edu
>>>>>                           +1 (206) 685 4435
>>>>>
>>>> http://artemide.bioeng.washington.edu/
>>>>>
>>>>>       Research Scientist at the Department of Bioengineering
>>>>>       at the University of Washington, Seattle WA U.S.A.
>>>>>       -----------------------------------------------------
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Aron Broom M.Sc
>>>>> PhD Student
>>>>> Department of Chemistry
>>>>> University of Waterloo
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -----------------------------------------------------
>>>>> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
>>>>>                     +1 (206) 685 4435
>>>>>                     http://artemide.bioeng.washington.edu/
>>>>>
>>>>> Research Scientist at the Department of Bioengineering
>>>>> at the University of Washington, Seattle WA U.S.A.
>>>>> -----------------------------------------------------
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Aron Broom M.Sc
>>>>> PhD Student
>>>>> Department of Chemistry
>>>>> University of Waterloo
>>>>>
>>>>>
>>>>>
>>>>
>>>> -----------------------------------------------------
>>>> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
>>>> +1 (206) 685 4435
>>>> http://artemide.bioeng.washington.edu/
>>>>
>>>> Research Scientist at the Department of Bioengineering
>>>> at the University of Washington, Seattle WA U.S.A.
>>>> -----------------------------------------------------
>>>
>>>
>>
>> -----------------------------------------------------
>> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
>> +1 (206) 685 4435
>> http://artemide.bioeng.washington.edu/
>>
>> Research Scientist at the Department of Bioengineering
>> at the University of Washington, Seattle WA U.S.A.
>> -----------------------------------------------------
>
>

-----------------------------------------------------
Gianluca Interlandi, PhD gianluca_at_u.washington.edu
                     +1 (206) 685 4435
                     http://artemide.bioeng.washington.edu/

Research Scientist at the Department of Bioengineering
at the University of Washington, Seattle WA U.S.A.
-----------------------------------------------------

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:47 CST