Re: Running NAMD on Forge (CUDA)

From: Gianluca Interlandi (gianluca_at_u.washington.edu)
Date: Thu Jul 12 2012 - 17:25:52 CDT

Yes, I was totally surprised, too. I also ran a non-CUDA job on Forge
using 16 CPUs. I got 0.122076 s/step, which is 16% slower than using the 6
GPUs (0.1 s/step) and a bit slower than 0.10932 s/step that I get on
Trestles using 16 cores. This difference might be statistical fluctuations
though (or configuration setup) since Forge and Trestles have the exact
same CPU, i.e., eight-core 2.4 GHz Magny-Cours.

Yes, Forge also uses NVIDIA M2070.

I keep thinking of this guy here in Seattle who works for NVIDIA downtown
and a few years ago he asked me: "How come you don't use CUDA?" Maybe the
code still needs some optimization, and CPU manufacturers have been doing
everything to catch up.

Gianluca

On Thu, 12 Jul 2012, Aron Broom wrote:

> So your speed for 1 or 2 GPUs (based on what your sent) is about 1.7 ns/day, which
> seems decent given the system size.  I was getting 2.0 and 2.6 ns/day for a 100k atom
> system with roughly those same parameters (and also 6-cpu cores), so given a scaling
> of ~nlogn, I would expect to see ~1.5 to 2.0 ns/day for you.  So in my mind, the
> speed you are getting with the GPUs isn't so surprising, it's that you get such a
> good speed with only the CPUs that shocks me.  In my case I didn't see speeds
> matching my 1 GPU until 48 CPU cores alone.  Seems like those Magny Cours are pretty
> awesome.
>
> Which GPUs are you using?  I was using mainly the M2070s.
>
> Also, one thing that might be useful, if you are able to get roughly the same speed
> with 6 cores and 2 GPUs and you get with 16 cores alone, is to test running 3 jobs at
> once, with 5 cores and 2 GPUs assigned to each and see how much slowdown there is. 
> You might be able to benefit from various replica techniques more than just hitting a
> single job with more power.
>
> Still, the overall conclusion from what you've got seems to be that it makes more
> sense to go with more of those CPUs rather than putting GPUs in there.
>
> ~Aron
>
> On Thu, Jul 12, 2012 at 4:58 PM, Gianluca Interlandi <gianluca_at_u.washington.edu>
> wrote:
> What are your simulation parameters:
>
> timestep (and also any multistepping values)
>
> 2 fs, SHAKE, no multistepping
>
> cutoff (and also the pairlist and PME grid spacing)
>
> 8-10-12  PME grid spacing ~ 1 A
>
> Have you tried giving it just 1 or 2 GPUs alone (using the
> +devices)?
>
>
> Yes, this is the benchmark time:
>
> np 1:  0.48615 s/step
> np 2:  0.26105 s/step
> np 4:  0.14542 s/step
> np 6:  0.10167 s/step
>
> I post here also part of the log running on 6 devices (in case it is helpful to
> localize the problem):
>
> Pe 4 has 57 local and 64 remote patches and 1066 local and 473 remote computes.
> Pe 1 has 57 local and 65 remote patches and 1057 local and 482 remote computes.
> Pe 5 has 57 local and 56 remote patches and 1150 local and 389 remote computes.
> Pe 2 has 57 local and 57 remote patches and 1052 local and 487 remote computes.
> Pe 3 has 58 local and 57 remote patches and 1079 local and 487 remote computes.
> Pe 0 has 57 local and 57 remote patches and 1144 local and 395 remote computes.
>
> Gianluca
>
>       Gianluca
>
>       On Thu, 12 Jul 2012, Aron Broom wrote:
>
>             have you tried the multicore build?  I wonder if the
> prebuilt
>             smp one is just not
>             working for you.
>
>             On Thu, Jul 12, 2012 at 3:21 PM, Gianluca Interlandi
>             <gianluca_at_u.washington.edu>
>             wrote:
>                         are other people also using those GPUs?
>
>
>             I don't think so since I reserved the entire node.
>
>                   What are the benchmark timings that you are given
> after
>             ~1000
>                   steps?
>
>
>             The benchmark time with 6 processes is 101 sec for 1000
>             steps. This is only
>             slightly faster than Trestles where I get 109 sec for
> 1000
>             steps running on 16
>             CPUs. So, yes 6 GPUs on Forge are much faster than 6
> cores on
>             Trestles, but in
>             terms of SUs it makes no difference, since on Forge I
> still
>             have to reserve the
>             entire node (16 cores).
>
>             Gianluca
>
>                   is some setup time.
>
>                   I often run a system of ~100,000 atoms, and I
> generally
>             see an
>                   order of magnitude
>                   improvement in speed compared to the same number
> of
>             cores without
>                   the GPUs.  I would
>                   test the non-CUDA precompiled cude on your Forge
> system
>             and see how
>                   that compares, it
>                   might be the fault of something other than CUDA.
>
>                   ~Aron
>
>                   On Thu, Jul 12, 2012 at 2:41 PM, Gianluca
> Interlandi
>                   <gianluca_at_u.washington.edu>
>                   wrote:
>                         Hi Aron,
>
>                         Thanks for the explanations. I don't know
> whether
>             I'm doing
>                   everything
>                         right. I don't see any speed advantage
> running on
>             the CUDA
>                   cluster
>                         (Forge) versus running on a non-CUDA
> cluster.
>
>                         I did the following benchmarks on Forge
> (the
>             system has
>                   127,000 atoms and
>                         ran for 1000 steps):
>
>                         np 1:  506 sec
>                         np 2:  281 sec
>                         np 4:  163 sec
>                         np 6:  136 sec
>                         np 12: 218 sec
>
>                         On the other hand, running the same system
> on 16
>             cores of
>                   Trestles (AMD
>                         Magny Cours) takes 129 sec. It seems that
> I'm not
>             really
>                   making good use
>                         of SUs by running on the CUDA cluster. Or,
> maybe
>             I'm doing
>                   something
>                         wrong? I'm using the ibverbs-smp-CUDA
>             pre-compiled version of
>                   NAMD 2.9.
>
>                         Thanks,
>
>                              Gianluca
>
>                         On Tue, 10 Jul 2012, Aron Broom wrote:
>
>                               if it is truly just one node, you can
> use
>             the
>                   multicore-CUDA
>                               version and avoid the
>                               MPI charmrun stuff.  Still, it boils
> down
>             to much the
>                   same
>                               thing I think.  If you do
>                               what you've done below, you are
> running one
>             job with 12
>                   CPU
>                               cores and all GPUs.  If
>                               you don't specify the +devices, NAMD
> will
>             automatically
>                   find
>                               the available GPUs, so I
>                               think the main benefit of specifying
> them
>             is when you
>                   are
>                               running more than one job
>                               and don't want the jobs sharing GPUs.
>
>                               I'm not sure you'll see great scaling
>             across 6 GPUs for
>                   a
>                               single job, but that would
>                               be great if you did.
>
>                               ~Aron
>
>                               On Tue, Jul 10, 2012 at 1:14 PM,
> Gianluca
>             Interlandi
>                               <gianluca_at_u.washington.edu>
>                               wrote:
>                                     Hi,
>
>                                     I have a question concerning
> running
>             NAMD on a
>                   CUDA
>                               cluster.
>
>                                     NCSA Forge has for example 6
> CUDA
>             devices and 16
>                   CPU
>                               cores per node. If I
>                                     want to use all 6 CUDA devices
> in a
>             node, how
>                   many
>                               processes is it
>                                     recommended to spawn? Do I need
> to
>             specify
>                   "+devices"?
>
>                                     So, if for example I want to
> spawn 12
>             processes,
>                   do I
>                               need to specify:
>
>                                     charmrun +p12 -machinefile
>             $PBS_NODEFILE +devices
>                               0,1,2,3,4,5 namd2
>                                     +idlepoll
>
>                                     Thanks,
>
>                                          Gianluca
>
>                                    
>                  
> -----------------------------------------------------
>                                     Gianluca Interlandi, PhD
>                   gianluca_at_u.washington.edu
>                                                         +1 (206)
> 685 4435
>                                                        
>                              
> http://artemide.bioeng.washington.edu/
>
>                                     Research Scientist at the
> Department
>             of
>                   Bioengineering
>                                     at the University of
> Washington,
>             Seattle WA
>                   U.S.A.
>                                    
>                  
> -----------------------------------------------------
>
>
>
>
>                               --
>                               Aron Broom M.Sc
>                               PhD Student
>                               Department of Chemistry
>                               University of Waterloo
>
>
>
>
>                        
>             -----------------------------------------------------
>                         Gianluca Interlandi, PhD
>             gianluca_at_u.washington.edu
>                                             +1 (206) 685 4435
>                                            
>             http://artemide.bioeng.washington.edu/
>
>                         Research Scientist at the Department of
>             Bioengineering
>                         at the University of Washington, Seattle WA
>             U.S.A.
>                        
>             -----------------------------------------------------
>
>
>
>
>                   --
>                   Aron Broom M.Sc
>                   PhD Student
>                   Department of Chemistry
>                   University of Waterloo
>
>
>
>
>             -----------------------------------------------------
>             Gianluca Interlandi, PhD gianluca_at_u.washington.edu
>                                 +1 (206) 685 4435
>                                
> http://artemide.bioeng.washington.edu/
>
>             Research Scientist at the Department of Bioengineering
>             at the University of Washington, Seattle WA U.S.A.
>             -----------------------------------------------------
>
>
>
>
>             --
>             Aron Broom M.Sc
>             PhD Student
>             Department of Chemistry
>             University of Waterloo
>
>
>
>
>       -----------------------------------------------------
>       Gianluca Interlandi, PhD gianluca_at_u.washington.edu
>                           +1 (206) 685 4435
>                           http://artemide.bioeng.washington.edu/
>
>       Research Scientist at the Department of Bioengineering
>       at the University of Washington, Seattle WA U.S.A.
>       -----------------------------------------------------
>
>
>
>
> --
> Aron Broom M.Sc
> PhD Student
> Department of Chemistry
> University of Waterloo
>
>
>
>
> -----------------------------------------------------
> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
>                     +1 (206) 685 4435
>                     http://artemide.bioeng.washington.edu/
>
> Research Scientist at the Department of Bioengineering
> at the University of Washington, Seattle WA U.S.A.
> -----------------------------------------------------
>
>
>
>
> --
> Aron Broom M.Sc
> PhD Student
> Department of Chemistry
> University of Waterloo
>
>
>

-----------------------------------------------------
Gianluca Interlandi, PhD gianluca_at_u.washington.edu
                     +1 (206) 685 4435
                     http://artemide.bioeng.washington.edu/

Research Scientist at the Department of Bioengineering
at the University of Washington, Seattle WA U.S.A.
-----------------------------------------------------

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:22:15 CST