Re: AW: Running NAMD on Forge (CUDA)

From: Gianluca Interlandi (gianluca_at_u.washington.edu)
Date: Fri Jul 13 2012 - 01:35:49 CDT

Hi Norman,

> What value do you use for fullelectfequency??

The default. I haven't set it.

> How many GPUs are there per node in this cluster?

6

> What kind of interconnect?

Infiniband.

Here are all specs:

http://www.ncsa.illinois.edu/UserInfo/Resources/Hardware/DellNVIDIACluster/TechSummary/index.html

Thanks,

      Gianluca

> Norman Geist.
>
>> -----Ursprüngliche Nachricht-----
>> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
>> Auftrag von Gianluca Interlandi
>> Gesendet: Freitag, 13. Juli 2012 00:26
>> An: Aron Broom
>> Cc: NAMD list
>> Betreff: Re: namd-l: Running NAMD on Forge (CUDA)
>>
>> Yes, I was totally surprised, too. I also ran a non-CUDA job on Forge
>> using 16 CPUs. I got 0.122076 s/step, which is 16% slower than using
>> the 6
>> GPUs (0.1 s/step) and a bit slower than 0.10932 s/step that I get on
>> Trestles using 16 cores. This difference might be statistical
>> fluctuations
>> though (or configuration setup) since Forge and Trestles have the exact
>> same CPU, i.e., eight-core 2.4 GHz Magny-Cours.
>>
>> Yes, Forge also uses NVIDIA M2070.
>>>> I keep thinking of this guy here in Seattle who works for NVIDIA
>> downtown
>> and a few years ago he asked me: "How come you don't use CUDA?" Maybe
>> the
>> code still needs some optimization, and CPU manufacturers have been
>> doing
>> everything to catch up.
>>
>> Gianluca
>>
>> On Thu, 12 Jul 2012, Aron Broom wrote:
>>
>>> So your speed for 1 or 2 GPUs (based on what your sent) is about 1.7
>> ns/day, which
>>> seems decent given the system size.  I was getting 2.0 and 2.6 ns/day
>> for a 100k atom
>>> system with roughly those same parameters (and also 6-cpu cores), so
>> given a scaling
>>> of ~nlogn, I would expect to see ~1.5 to 2.0 ns/day for you.  So in
>> my mind, the
>>> speed you are getting with the GPUs isn't so surprising, it's that
>> you get such a
>>> good speed with only the CPUs that shocks me.  In my case I didn't
>> see speeds
>>> matching my 1 GPU until 48 CPU cores alone.  Seems like those Magny
>> Cours are pretty
>>> awesome.
>>>
>>> Which GPUs are you using?  I was using mainly the M2070s.
>>>
>>> Also, one thing that might be useful, if you are able to get roughly
>> the same speed
>>> with 6 cores and 2 GPUs and you get with 16 cores alone, is to test
>> running 3 jobs at
>>> once, with 5 cores and 2 GPUs assigned to each and see how much
>> slowdown there is.
>>> You might be able to benefit from various replica techniques more
>> than just hitting a
>>> single job with more power.
>>>
>>> Still, the overall conclusion from what you've got seems to be that
>> it makes more
>>> sense to go with more of those CPUs rather than putting GPUs in
>> there.
>>>
>>> ~Aron
>>>
>>> On Thu, Jul 12, 2012 at 4:58 PM, Gianluca Interlandi
>> <gianluca_at_u.washington.edu>
>>> wrote:
>>> What are your simulation parameters:
>>>
>>> timestep (and also any multistepping values)
>>>
>>> 2 fs, SHAKE, no multistepping
>>>
>>> cutoff (and also the pairlist and PME grid spacing)
>>>
>>> 8-10-12  PME grid spacing ~ 1 A
>>>
>>> Have you tried giving it just 1 or 2 GPUs alone (using the
>>> +devices)?
>>>
>>>
>>> Yes, this is the benchmark time:
>>>
>>> np 1:  0.48615 s/step
>>> np 2:  0.26105 s/step
>>> np 4:  0.14542 s/step
>>> np 6:  0.10167 s/step
>>>
>>> I post here also part of the log running on 6 devices (in case it is
>> helpful to
>>> localize the problem):
>>>
>>> Pe 4 has 57 local and 64 remote patches and 1066 local and 473 remote
>> computes.
>>> Pe 1 has 57 local and 65 remote patches and 1057 local and 482 remote
>> computes.
>>> Pe 5 has 57 local and 56 remote patches and 1150 local and 389 remote
>> computes.
>>> Pe 2 has 57 local and 57 remote patches and 1052 local and 487 remote
>> computes.
>>> Pe 3 has 58 local and 57 remote patches and 1079 local and 487 remote
>> computes.
>>> Pe 0 has 57 local and 57 remote patches and 1144 local and 395 remote
>> computes.
>>>
>>> Gianluca
>>>
>>>       Gianluca
>>>
>>>       On Thu, 12 Jul 2012, Aron Broom wrote:
>>>
>>>             have you tried the multicore build?  I wonder if
>> the
>>> prebuilt
>>>             smp one is just not
>>>             working for you.
>>>
>>>             On Thu, Jul 12, 2012 at 3:21 PM, Gianluca
>> Interlandi
>>>             <gianluca_at_u.washington.edu>
>>>             wrote:
>>>                         are other people also using those GPUs?
>>>
>>>
>>>             I don't think so since I reserved the entire node.
>>>
>>>                   What are the benchmark timings that you are
>> given
>>> after
>>>             ~1000
>>>                   steps?
>>>
>>>
>>>             The benchmark time with 6 processes is 101 sec for
>> 1000
>>>             steps. This is only
>>>             slightly faster than Trestles where I get 109 sec
>> for
>>> 1000
>>>             steps running on 16
>>>             CPUs. So, yes 6 GPUs on Forge are much faster than
>> 6
>>> cores on
>>>             Trestles, but in
>>>             terms of SUs it makes no difference, since on Forge
>> I
>>> still
>>>             have to reserve the
>>>             entire node (16 cores).
>>>
>>>             Gianluca
>>>
>>>                   is some setup time.
>>>
>>>                   I often run a system of ~100,000 atoms, and I
>>> generally
>>>             see an
>>>                   order of magnitude
>>>                   improvement in speed compared to the same
>> number
>>> of
>>>             cores without
>>>                   the GPUs.  I would
>>>                   test the non-CUDA precompiled cude on your
>> Forge
>>> system
>>>             and see how
>>>                   that compares, it
>>>                   might be the fault of something other than
>> CUDA.
>>>
>>>                   ~Aron
>>>
>>>                   On Thu, Jul 12, 2012 at 2:41 PM, Gianluca
>>> Interlandi
>>>                   <gianluca_at_u.washington.edu>
>>>                   wrote:
>>>                         Hi Aron,
>>>
>>>                         Thanks for the explanations. I don't
>> know
>>> whether
>>>             I'm doing
>>>                   everything
>>>                         right. I don't see any speed advantage
>>> running on
>>>             the CUDA
>>>                   cluster
>>>                         (Forge) versus running on a non-CUDA
>>> cluster.
>>>
>>>                         I did the following benchmarks on Forge
>>> (the
>>>             system has
>>>                   127,000 atoms and
>>>                         ran for 1000 steps):
>>>
>>>                         np 1:  506 sec
>>>                         np 2:  281 sec
>>>                         np 4:  163 sec
>>>                         np 6:  136 sec
>>>                         np 12: 218 sec
>>>
>>>                         On the other hand, running the same
>> system
>>> on 16
>>>             cores of
>>>                   Trestles (AMD
>>>                         Magny Cours) takes 129 sec. It seems
>> that
>>> I'm not
>>>             really
>>>                   making good use
>>>                         of SUs by running on the CUDA cluster.
>> Or,
>>> maybe
>>>             I'm doing
>>>                   something
>>>                         wrong? I'm using the ibverbs-smp-CUDA
>>>             pre-compiled version of
>>>                   NAMD 2.9.
>>>
>>>                         Thanks,
>>>
>>>                              Gianluca
>>>
>>>                         On Tue, 10 Jul 2012, Aron Broom wrote:
>>>
>>>                               if it is truly just one node, you
>> can
>>> use
>>>             the
>>>                   multicore-CUDA
>>>                               version and avoid the
>>>                               MPI charmrun stuff.  Still, it
>> boils
>>> down
>>>             to much the
>>>                   same
>>>                               thing I think.  If you do
>>>                               what you've done below, you are
>>> running one
>>>             job with 12
>>>                   CPU
>>>                               cores and all GPUs.  If
>>>                               you don't specify the +devices,
>> NAMD
>>> will
>>>             automatically
>>>                   find
>>>                               the available GPUs, so I
>>>                               think the main benefit of
>> specifying
>>> them
>>>             is when you
>>>                   are
>>>                               running more than one job
>>>                               and don't want the jobs sharing
>> GPUs.
>>>
>>>                               I'm not sure you'll see great
>> scaling
>>>             across 6 GPUs for
>>>                   a
>>>                               single job, but that would
>>>                               be great if you did.
>>>
>>>                               ~Aron
>>>
>>>                               On Tue, Jul 10, 2012 at 1:14 PM,
>>> Gianluca
>>>             Interlandi
>>>                               <gianluca_at_u.washington.edu>
>>>                               wrote:
>>>                                     Hi,
>>>
>>>                                     I have a question
>> concerning
>>> running
>>>             NAMD on a
>>>                   CUDA
>>>                               cluster.
>>>
>>>                                     NCSA Forge has for example
>> 6
>>> CUDA
>>>             devices and 16
>>>                   CPU
>>>                               cores per node. If I
>>>                                     want to use all 6 CUDA
>> devices
>>> in a
>>>             node, how
>>>                   many
>>>                               processes is it
>>>                                     recommended to spawn? Do I
>> need
>>> to
>>>             specify
>>>                   "+devices"?
>>>
>>>                                     So, if for example I want
>> to
>>> spawn 12
>>>             processes,
>>>                   do I
>>>                               need to specify:
>>>
>>>                                     charmrun +p12 -machinefile
>>>             $PBS_NODEFILE +devices
>>>                               0,1,2,3,4,5 namd2
>>>                                     +idlepoll
>>>
>>>                                     Thanks,
>>>
>>>                                          Gianluca
>>>
>>>
>>>
>>> -----------------------------------------------------
>>>                                     Gianluca Interlandi, PhD
>>>                   gianluca_at_u.washington.edu
>>>                                                         +1
>> (206)
>>> 685 4435
>>>
>>>
>>> http://artemide.bioeng.washington.edu/
>>>
>>>                                     Research Scientist at the
>>> Department
>>>             of
>>>                   Bioengineering
>>>                                     at the University of
>>> Washington,
>>>             Seattle WA
>>>                   U.S.A.
>>>
>>>
>>> -----------------------------------------------------
>>>
>>>
>>>
>>>
>>>                               --
>>>                               Aron Broom M.Sc
>>>                               PhD Student
>>>                               Department of Chemistry
>>>                               University of Waterloo
>>>
>>>
>>>
>>>
>>>
>>>             ---------------------------------------------------
>> --
>>>                         Gianluca Interlandi, PhD
>>>             gianluca_at_u.washington.edu
>>>                                             +1 (206) 685 4435
>>>
>>>             http://artemide.bioeng.washington.edu/
>>>
>>>                         Research Scientist at the Department of
>>>             Bioengineering
>>>                         at the University of Washington,
>> Seattle WA
>>>             U.S.A.
>>>
>>>             ---------------------------------------------------
>> --
>>>
>>>
>>>
>>>
>>>                   --
>>>                   Aron Broom M.Sc
>>>                   PhD Student
>>>                   Department of Chemistry
>>>                   University of Waterloo
>>>
>>>
>>>
>>>
>>>             ---------------------------------------------------
>> --
>>>             Gianluca Interlandi, PhD gianluca_at_u.washington.edu
>>>                                 +1 (206) 685 4435
>>>
>>> http://artemide.bioeng.washington.edu/
>>>
>>>             Research Scientist at the Department of
>> Bioengineering
>>>             at the University of Washington, Seattle WA U.S.A.
>>>             ---------------------------------------------------
>> --
>>>
>>>
>>>
>>>
>>>             --
>>>             Aron Broom M.Sc
>>>             PhD Student
>>>             Department of Chemistry
>>>             University of Waterloo
>>>
>>>
>>>
>>>
>>>       -----------------------------------------------------
>>>       Gianluca Interlandi, PhD gianluca_at_u.washington.edu
>>>                           +1 (206) 685 4435
>>>
>> http://artemide.bioeng.washington.edu/
>>>
>>>       Research Scientist at the Department of Bioengineering
>>>       at the University of Washington, Seattle WA U.S.A.
>>>       -----------------------------------------------------
>>>
>>>
>>>
>>>
>>> --
>>> Aron Broom M.Sc
>>> PhD Student
>>> Department of Chemistry
>>> University of Waterloo
>>>
>>>
>>>
>>>
>>> -----------------------------------------------------
>>> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
>>>                     +1 (206) 685 4435
>>>                     http://artemide.bioeng.washington.edu/
>>>
>>> Research Scientist at the Department of Bioengineering
>>> at the University of Washington, Seattle WA U.S.A.
>>> -----------------------------------------------------
>>>
>>>
>>>
>>>
>>> --
>>> Aron Broom M.Sc
>>> PhD Student
>>> Department of Chemistry
>>> University of Waterloo
>>>
>>>
>>>
>>
>> -----------------------------------------------------
>> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
>> +1 (206) 685 4435
>> http://artemide.bioeng.washington.edu/
>>
>> Research Scientist at the Department of Bioengineering
>> at the University of Washington, Seattle WA U.S.A.
>> -----------------------------------------------------
>
>

-----------------------------------------------------
Gianluca Interlandi, PhD gianluca_at_u.washington.edu
                     +1 (206) 685 4435
                     http://artemide.bioeng.washington.edu/

Research Scientist at the Department of Bioengineering
at the University of Washington, Seattle WA U.S.A.
-----------------------------------------------------

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:47 CST