From: Gianluca Interlandi (gianluca_at_u.washington.edu)
Date: Fri Jul 13 2012 - 14:08:57 CDT
Hi Norman,
>>> What value do you use for fullelectfequency??
> Ok, then it's 1 I guess.
Do you recommend to set 'stepspercycle' to 12? Currently I have it set to 
10 but 'fullelectfequency' needs to be a factor of 'stepspercycle'.
Thanks,
Gianluca
>> -----Ursprüngliche Nachricht-----
>> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
>> Auftrag von Gianluca Interlandi
>> Gesendet: Freitag, 13. Juli 2012 08:36
>> An: Norman Geist
>> Cc: Namd Mailing List
>> Betreff: Re: AW: namd-l: Running NAMD on Forge (CUDA)
>>
>> Hi Norman,
>
> Hi,
>
>>
>>> What value do you use for fullelectfequency??
>>
>> The default. I haven't set it.
>
ons as the
> electrostatic is done on the cpu. This causes much traffic between cpu and
> gpu and messes up the PCI-E. Additionally 6 GPU's I could imagine do also
> need a lot of PCI-E bandwidth, so it's likely that the performance of the
> GPUs is not as expected. You should try to set fullelectfrequency to at
> least 4 and try out the new molly parameter. This should cause less traffic
> on PCI-E and improve the GPUs utilization but does little harm the energy
> conservation what shows up as slightly increasing temperature. But with the
> molly parameter it should be ok I think.
>
>>
>>> How many GPUs are there per node in this cluster?
>>
>> 6
>>
>>> What kind of interconnect?
>>
>> Infiniband.
>
> Please make sure if you are running over multiple nodes, that you make use
> of the infiniband interconnect. Therefore you need a ibverbs binary of NAMD
> or there must be IPoIB installed. You can see if IPoIB is working if there
> is a ib0 interface for example when you do ifconfig. Also as I observed,
> IPoIB should be configured with the connected mode and a mtu of about 65520
> (cat /sys/class/net/ib0/mode or mtu to see the current settings)
>
>>
>> Here are all specs:
>>
>> http://www.ncsa.illinois.edu/UserInfo/Resources/Hardware/DellNVIDIAClus
>> ter/TechSummary/index.html
>>
>> Thanks,
>>
>>       Gianluca
>>
>>> Norman Geist.
>>>
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
>>>> Auftrag von Gianluca Interlandi
>>>> Gesendet: Freitag, 13. Juli 2012 00:26
>>>> An: Aron Broom
>>>> Cc: NAMD list
>>>> Betreff: Re: namd-l: Running NAMD on Forge (CUDA)
>>>>
>>>> Yes, I was totally surprised, too. I also ran a non-CUDA job on
>> Forge
>>>> using 16 CPUs. I got 0.122076 s/step, which is 16% slower than using
>>>> the 6
>>>> GPUs (0.1 s/step) and a bit slower than 0.10932 s/step that I get on
>>>> Trestles using 16 cores. This difference might be statistical
>>>> fluctuations
>>>> though (or configuration setup) since Forge and Trestles have the
>> exact
>>>> same CPU, i.e., eight-core 2.4 GHz Magny-Cours.
>>>>
>>>> Yes, Forge also uses NVIDIA M2070.
>>>>>> I keep thinking of this guy here in Seattle who works for NVIDIA
>>>> downtown
>>>> and a few years ago he asked me: "How come you don't use CUDA?"
>> Maybe
>>>> the
>>>> code still needs some optimization, and CPU manufacturers have been
>>>> doing
>>>> everything to catch up.
>>>>
>>>> Gianluca
>>>>
>>>> On Thu, 12 Jul 2012, Aron Broom wrote:
>>>>
>>>>> So your speed for 1 or 2 GPUs (based on what your sent) is about
>> 1.7
>>>> ns/day, which
>>>>> seems decent given the system size.  I was getting 2.0 and 2.6
>> ns/day
>>>> for a 100k atom
>>>>> system with roughly those same parameters (and also 6-cpu cores),
>> so
>>>> given a scaling
>>>>> of ~nlogn, I would expect to see ~1.5 to 2.0 ns/day for you.  So in
>>>> my mind, the
>>>>> speed you are getting with the GPUs isn't so surprising, it's that
>>>> you get such a
>>>>> good speed with only the CPUs that shocks me.  In my case I didn't
>>>> see speeds
>>>>> matching my 1 GPU until 48 CPU cores alone.  Seems like those Magny
>>>> Cours are pretty
>>>>> awesome.
>>>>>
>>>>> Which GPUs are you using?  I was using mainly the M2070s.
>>>>>
>>>>> Also, one thing that might be useful, if you are able to get
>> roughly
>>>> the same speed
>>>>> with 6 cores and 2 GPUs and you get with 16 cores alone, is to test
>>>> running 3 jobs at
>>>>> once, with 5 cores and 2 GPUs assigned to each and see how much
>>>> slowdown there is.
>>>>> You might be able to benefit from various replica techniques more
>>>> than just hitting a
>>>>> single job with more power.
>>>>>
>>>>> Still, the overall conclusion from what you've got seems to be that
>>>> it makes more
>>>>> sense to go with more of those CPUs rather than putting GPUs in
>>>> there.
>>>>>
>>>>> ~Aron
>>>>>
>>>>> On Thu, Jul 12, 2012 at 4:58 PM, Gianluca Interlandi
>>>> <gianluca_at_u.washington.edu>
>>>>> wrote:
>>>>>             What are your simulation parameters:
>>>>>
>>>>>             timestep (and also any multistepping values)
>>>>>
>>>>> 2 fs, SHAKE, no multistepping
>>>>>
>>>>>       cutoff (and also the pairlist and PME grid spacing)
>>>>>
>>>>> 8-10-12  PME grid spacing ~ 1 A
>>>>>
>>>>>       Have you tried giving it just 1 or 2 GPUs alone (using the
>>>>>       +devices)?
>>>>>
>>>>>
>>>>> Yes, this is the benchmark time:
>>>>>
>>>>> np 1:  0.48615 s/step
>>>>> np 2:  0.26105 s/step
>>>>> np 4:  0.14542 s/step
>>>>> np 6:  0.10167 s/step
>>>>>
>>>>> I post here also part of the log running on 6 devices (in case it
>> is
>>>> helpful to
>>>>> localize the problem):
>>>>>
>>>>> Pe 4 has 57 local and 64 remote patches and 1066 local and 473
>> remote
>>>> computes.
>>>>> Pe 1 has 57 local and 65 remote patches and 1057 local and 482
>> remote
>>>> computes.
>>>>> Pe 5 has 57 local and 56 remote patches and 1150 local and 389
>> remote
>>>> computes.
>>>>> Pe 2 has 57 local and 57 remote patches and 1052 local and 487
>> remote
>>>> computes.
>>>>> Pe 3 has 58 local and 57 remote patches and 1079 local and 487
>> remote
>>>> computes.
>>>>> Pe 0 has 57 local and 57 remote patches and 1144 local and 395
>> remote
>>>> computes.
>>>>>
>>>>> Gianluca
>>>>>
>>>>>             Gianluca
>>>>>
>>>>>             On Thu, 12 Jul 2012, Aron Broom wrote:
>>>>>
>>>>>                   have you tried the multicore build?  I wonder if
>>>> the
>>>>>       prebuilt
>>>>>                   smp one is just not
>>>>>                   working for you.
>>>>>
>>>>>                   On Thu, Jul 12, 2012 at 3:21 PM, Gianluca
>>>> Interlandi
>>>>>                   <gianluca_at_u.washington.edu>
>>>>>                   wrote:
>>>>>                               are other people also using those
>> GPUs?
>>>>>
>>>>>
>>>>>                   I don't think so since I reserved the entire
>> node.
>>>>>
>>>>>                         What are the benchmark timings that you are
>>>> given
>>>>>       after
>>>>>                   ~1000
>>>>>                         steps?
>>>>>
>>>>>
>>>>>                   The benchmark time with 6 processes is 101 sec
>> for
>>>> 1000
>>>>>                   steps. This is only
>>>>>                   slightly faster than Trestles where I get 109 sec
>>>> for
>>>>>       1000
>>>>>                   steps running on 16
>>>>>                   CPUs. So, yes 6 GPUs on Forge are much faster
>> than
>>>> 6
>>>>>       cores on
>>>>>                   Trestles, but in
>>>>>                   terms of SUs it makes no difference, since on
>> Forge
>>>> I
>>>>>       still
>>>>>                   have to reserve the
>>>>>                   entire node (16 cores).
>>>>>
>>>>>                   Gianluca
>>>>>
>>>>>                         is some setup time.
>>>>>
>>>>>                         I often run a system of ~100,000 atoms, and
>> I
>>>>>       generally
>>>>>                   see an
>>>>>                         order of magnitude
>>>>>                         improvement in speed compared to the same
>>>> number
>>>>>       of
>>>>>                   cores without
>>>>>                         the GPUs.  I would
>>>>>                         test the non-CUDA precompiled cude on your
>>>> Forge
>>>>>       system
>>>>>                   and see how
>>>>>                         that compares, it
>>>>>                         might be the fault of something other than
>>>> CUDA.
>>>>>
>>>>>                         ~Aron
>>>>>
>>>>>                         On Thu, Jul 12, 2012 at 2:41 PM, Gianluca
>>>>>       Interlandi
>>>>>                         <gianluca_at_u.washington.edu>
>>>>>                         wrote:
>>>>>                               Hi Aron,
>>>>>
>>>>>                               Thanks for the explanations. I don't
>>>> know
>>>>>       whether
>>>>>                   I'm doing
>>>>>                         everything
>>>>>                               right. I don't see any speed
>> advantage
>>>>>       running on
>>>>>                   the CUDA
>>>>>                         cluster
>>>>>                               (Forge) versus running on a non-CUDA
>>>>>       cluster.
>>>>>
>>>>>                               I did the following benchmarks on
>> Forge
>>>>>       (the
>>>>>                   system has
>>>>>                         127,000 atoms and
>>>>>                               ran for 1000 steps):
>>>>>
>>>>>                               np 1:  506 sec
>>>>>                               np 2:  281 sec
>>>>>                               np 4:  163 sec
>>>>>                               np 6:  136 sec
>>>>>                               np 12: 218 sec
>>>>>
>>>>>                               On the other hand, running the same
>>>> system
>>>>>       on 16
>>>>>                   cores of
>>>>>                         Trestles (AMD
>>>>>                               Magny Cours) takes 129 sec. It seems
>>>> that
>>>>>       I'm not
>>>>>                   really
>>>>>                         making good use
>>>>>                               of SUs by running on the CUDA
>> cluster.
>>>> Or,
>>>>>       maybe
>>>>>                   I'm doing
>>>>>                         something
>>>>>                               wrong? I'm using the ibverbs-smp-CUDA
>>>>>                   pre-compiled version of
>>>>>                         NAMD 2.9.
>>>>>
>>>>>                               Thanks,
>>>>>
>>>>>                                    Gianluca
>>>>>
>>>>>                               On Tue, 10 Jul 2012, Aron Broom
>> wrote:
>>>>>
>>>>>                                     if it is truly just one node,
>> you
>>>> can
>>>>>       use
>>>>>                   the
>>>>>                         multicore-CUDA
>>>>>                                     version and avoid the
>>>>>                                     MPI charmrun stuff.  Still, it
>>>> boils
>>>>>       down
>>>>>                   to much the
>>>>>                         same
>>>>>                                     thing I think.  If you do
>>>>>                                     what you've done below, you are
>>>>>       running one
>>>>>                   job with 12
>>>>>                         CPU
>>>>>                                     cores and all GPUs.  If
>>>>>                                     you don't specify the +devices,
>>>> NAMD
>>>>>       will
>>>>>                   automatically
>>>>>                         find
>>>>>                                     the available GPUs, so I
>>>>>                                     think the main benefit of
>>>> specifying
>>>>>       them
>>>>>                   is when you
>>>>>                         are
>>>>>                                     running more than one job
>>>>>                                     and don't want the jobs sharing
>>>> GPUs.
>>>>>
>>>>>                                     I'm not sure you'll see great
>>>> scaling
>>>>>                   across 6 GPUs for
>>>>>                         a
>>>>>                                     single job, but that would
>>>>>                                     be great if you did.
>>>>>
>>>>>                                     ~Aron
>>>>>
>>>>>                                     On Tue, Jul 10, 2012 at 1:14
>> PM,
>>>>>       Gianluca
>>>>>                   Interlandi
>>>>>                                     <gianluca_at_u.washington.edu>
>>>>>                                     wrote:
>>>>>                                           Hi,
>>>>>
>>>>>                                           I have a question
>>>> concerning
>>>>>       running
>>>>>                   NAMD on a
>>>>>                         CUDA
>>>>>                                     cluster.
>>>>>
>>>>>                                           NCSA Forge has for
>> example
>>>> 6
>>>>>       CUDA
>>>>>                   devices and 16
>>>>>                         CPU
>>>>>                                     cores per node. If I
>>>>>                                           want to use all 6 CUDA
>>>> devices
>>>>>       in a
>>>>>                   node, how
>>>>>                         many
>>>>>                                     processes is it
>>>>>                                           recommended to spawn? Do
>> I
>>>> need
>>>>>       to
>>>>>                   specify
>>>>>                         "+devices"?
>>>>>
>>>>>                                           So, if for example I want
>>>> to
>>>>>       spawn 12
>>>>>                   processes,
>>>>>                         do I
>>>>>                                     need to specify:
>>>>>
>>>>>                                           charmrun +p12 -
>> machinefile
>>>>>                   $PBS_NODEFILE +devices
>>>>>                                     0,1,2,3,4,5 namd2
>>>>>                                           +idlepoll
>>>>>
>>>>>                                           Thanks,
>>>>>
>>>>>                                                Gianluca
>>>>>
>>>>>
>>>>>
>>>>>       -----------------------------------------------------
>>>>>                                           Gianluca Interlandi, PhD
>>>>>                         gianluca_at_u.washington.edu
>>>>>                                                               +1
>>>> (206)
>>>>>       685 4435
>>>>>
>>>>>
>>>>>       http://artemide.bioeng.washington.edu/
>>>>>
>>>>>                                           Research Scientist at the
>>>>>       Department
>>>>>                   of
>>>>>                         Bioengineering
>>>>>                                           at the University of
>>>>>       Washington,
>>>>>                   Seattle WA
>>>>>                         U.S.A.
>>>>>
>>>>>
>>>>>       -----------------------------------------------------
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                                     --
>>>>>                                     Aron Broom M.Sc
>>>>>                                     PhD Student
>>>>>                                     Department of Chemistry
>>>>>                                     University of Waterloo
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                   -------------------------------------------------
>> --
>>>> --
>>>>>                               Gianluca Interlandi, PhD
>>>>>                   gianluca_at_u.washington.edu
>>>>>                                                   +1 (206) 685 4435
>>>>>
>>>>>                   http://artemide.bioeng.washington.edu/
>>>>>
>>>>>                               Research Scientist at the Department
>> of
>>>>>                   Bioengineering
>>>>>                               at the University of Washington,
>>>> Seattle WA
>>>>>                   U.S.A.
>>>>>
>>>>>                   -------------------------------------------------
>> --
>>>> --
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                         --
>>>>>                         Aron Broom M.Sc
>>>>>                         PhD Student
>>>>>                         Department of Chemistry
>>>>>                         University of Waterloo
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                   -------------------------------------------------
>> --
>>>> --
>>>>>                   Gianluca Interlandi, PhD
>> gianluca_at_u.washington.edu
>>>>>                                       +1 (206) 685 4435
>>>>>
>>>>>       http://artemide.bioeng.washington.edu/
>>>>>
>>>>>                   Research Scientist at the Department of
>>>> Bioengineering
>>>>>                   at the University of Washington, Seattle WA
>> U.S.A.
>>>>>                   -------------------------------------------------
>> --
>>>> --
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                   --
>>>>>                   Aron Broom M.Sc
>>>>>                   PhD Student
>>>>>                   Department of Chemistry
>>>>>                   University of Waterloo
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>             -----------------------------------------------------
>>>>>             Gianluca Interlandi, PhD gianluca_at_u.washington.edu
>>>>>                                 +1 (206) 685 4435
>>>>>
>>>> http://artemide.bioeng.washington.edu/
>>>>>
>>>>>             Research Scientist at the Department of Bioengineering
>>>>>             at the University of Washington, Seattle WA U.S.A.
>>>>>             -----------------------------------------------------
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>       --
>>>>>       Aron Broom M.Sc
>>>>>       PhD Student
>>>>>       Department of Chemistry
>>>>>       University of Waterloo
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -----------------------------------------------------
>>>>> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
>>>>>                     +1 (206) 685 4435
>>>>>                     http://artemide.bioeng.washington.edu/
>>>>>
>>>>> Research Scientist at the Department of Bioengineering
>>>>> at the University of Washington, Seattle WA U.S.A.
>>>>> -----------------------------------------------------
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Aron Broom M.Sc
>>>>> PhD Student
>>>>> Department of Chemistry
>>>>> University of Waterloo
>>>>>
>>>>>
>>>>>
>>>>
>>>> -----------------------------------------------------
>>>> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
>>>>                      +1 (206) 685 4435
>>>>                      http://artemide.bioeng.washington.edu/
>>>>
>>>> Research Scientist at the Department of Bioengineering
>>>> at the University of Washington, Seattle WA U.S.A.
>>>> -----------------------------------------------------
>>>
>>>
>>
>> -----------------------------------------------------
>> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
>>                      +1 (206) 685 4435
>>                      http://artemide.bioeng.washington.edu/
>>
>> Research Scientist at the Department of Bioengineering
>> at the University of Washington, Seattle WA U.S.A.
>> -----------------------------------------------------
>
>
-----------------------------------------------------
Gianluca Interlandi, PhD gianluca_at_u.washington.edu
                     +1 (206) 685 4435
                     http://artemide.bioeng.washington.edu/
Research Scientist at the Department of Bioengineering
at the University of Washington, Seattle WA U.S.A.
-----------------------------------------------------
This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:47 CST