From: Aron Broom (broomsday_at_gmail.com)
Date: Thu Jul 12 2012 - 16:12:39 CDT
So your speed for 1 or 2 GPUs (based on what your sent) is about 1.7
ns/day, which seems decent given the system size. I was getting 2.0 and
2.6 ns/day for a 100k atom system with roughly those same parameters (and
also 6-cpu cores), so given a scaling of ~nlogn, I would expect to see ~1.5
to 2.0 ns/day for you. So in my mind, the speed you are getting with the
GPUs isn't so surprising, it's that you get such a good speed with only the
CPUs that shocks me. In my case I didn't see speeds matching my 1 GPU
until 48 CPU cores alone. Seems like those Magny Cours are pretty awesome.
Which GPUs are you using? I was using mainly the M2070s.
Also, one thing that might be useful, if you are able to get roughly the
same speed with 6 cores and 2 GPUs and you get with 16 cores alone, is to
test running 3 jobs at once, with 5 cores and 2 GPUs assigned to each and
see how much slowdown there is. You might be able to benefit from various
replica techniques more than just hitting a single job with more power.
Still, the overall conclusion from what you've got seems to be that it
makes more sense to go with more of those CPUs rather than putting GPUs in
there.
~Aron
On Thu, Jul 12, 2012 at 4:58 PM, Gianluca Interlandi <
gianluca_at_u.washington.edu> wrote:
> What are your simulation parameters:
>>
>> timestep (and also any multistepping values)
>>
> 2 fs, SHAKE, no multistepping
>
>
> cutoff (and also the pairlist and PME grid spacing)
>>
> 8-10-12 PME grid spacing ~ 1 A
>
>
> Have you tried giving it just 1 or 2 GPUs alone (using the +devices)?
>>
>
> Yes, this is the benchmark time:
>
> np 1: 0.48615 s/step
> np 2: 0.26105 s/step
> np 4: 0.14542 s/step
> np 6: 0.10167 s/step
>
> I post here also part of the log running on 6 devices (in case it is
> helpful to localize the problem):
>
> Pe 4 has 57 local and 64 remote patches and 1066 local and 473 remote
> computes.
> Pe 1 has 57 local and 65 remote patches and 1057 local and 482 remote
> computes.
> Pe 5 has 57 local and 56 remote patches and 1150 local and 389 remote
> computes.
> Pe 2 has 57 local and 57 remote patches and 1052 local and 487 remote
> computes.
> Pe 3 has 58 local and 57 remote patches and 1079 local and 487 remote
> computes.
> Pe 0 has 57 local and 57 remote patches and 1144 local and 395 remote
> computes.
>
> Gianluca
>
>
> Gianluca
>>
>> On Thu, 12 Jul 2012, Aron Broom wrote:
>>
>> have you tried the multicore build? I wonder if the prebuilt
>> smp one is just not
>> working for you.
>>
>> On Thu, Jul 12, 2012 at 3:21 PM, Gianluca Interlandi
>> <gianluca_at_u.washington.edu>
>> wrote:
>> are other people also using those GPUs?
>>
>>
>> I don't think so since I reserved the entire node.
>>
>> What are the benchmark timings that you are given after
>> ~1000
>> steps?
>>
>>
>> The benchmark time with 6 processes is 101 sec for 1000
>> steps. This is only
>> slightly faster than Trestles where I get 109 sec for 1000
>> steps running on 16
>> CPUs. So, yes 6 GPUs on Forge are much faster than 6 cores on
>> Trestles, but in
>> terms of SUs it makes no difference, since on Forge I still
>> have to reserve the
>> entire node (16 cores).
>>
>> Gianluca
>>
>> is some setup time.
>>
>> I often run a system of ~100,000 atoms, and I generally
>> see an
>> order of magnitude
>> improvement in speed compared to the same number of
>> cores without
>> the GPUs. I would
>> test the non-CUDA precompiled cude on your Forge system
>> and see how
>> that compares, it
>> might be the fault of something other than CUDA.
>>
>> ~Aron
>>
>> On Thu, Jul 12, 2012 at 2:41 PM, Gianluca Interlandi
>> <gianluca_at_u.washington.edu>
>> wrote:
>> Hi Aron,
>>
>> Thanks for the explanations. I don't know whether
>> I'm doing
>> everything
>> right. I don't see any speed advantage running on
>> the CUDA
>> cluster
>> (Forge) versus running on a non-CUDA cluster.
>>
>> I did the following benchmarks on Forge (the
>> system has
>> 127,000 atoms and
>> ran for 1000 steps):
>>
>> np 1: 506 sec
>> np 2: 281 sec
>> np 4: 163 sec
>> np 6: 136 sec
>> np 12: 218 sec
>>
>> On the other hand, running the same system on 16
>> cores of
>> Trestles (AMD
>> Magny Cours) takes 129 sec. It seems that I'm not
>> really
>> making good use
>> of SUs by running on the CUDA cluster. Or, maybe
>> I'm doing
>> something
>> wrong? I'm using the ibverbs-smp-CUDA
>> pre-compiled version of
>> NAMD 2.9.
>>
>> Thanks,
>>
>> Gianluca
>>
>> On Tue, 10 Jul 2012, Aron Broom wrote:
>>
>> if it is truly just one node, you can use
>> the
>> multicore-CUDA
>> version and avoid the
>> MPI charmrun stuff. Still, it boils down
>> to much the
>> same
>> thing I think. If you do
>> what you've done below, you are running one
>> job with 12
>> CPU
>> cores and all GPUs. If
>> you don't specify the +devices, NAMD will
>> automatically
>> find
>> the available GPUs, so I
>> think the main benefit of specifying them
>> is when you
>> are
>> running more than one job
>> and don't want the jobs sharing GPUs.
>>
>> I'm not sure you'll see great scaling
>> across 6 GPUs for
>> a
>> single job, but that would
>> be great if you did.
>>
>> ~Aron
>>
>> On Tue, Jul 10, 2012 at 1:14 PM, Gianluca
>> Interlandi
>> <gianluca_at_u.washington.edu>
>> wrote:
>> Hi,
>>
>> I have a question concerning running
>> NAMD on a
>> CUDA
>> cluster.
>>
>> NCSA Forge has for example 6 CUDA
>> devices and 16
>> CPU
>> cores per node. If I
>> want to use all 6 CUDA devices in a
>> node, how
>> many
>> processes is it
>> recommended to spawn? Do I need to
>> specify
>> "+devices"?
>>
>> So, if for example I want to spawn 12
>> processes,
>> do I
>> need to specify:
>>
>> charmrun +p12 -machinefile
>> $PBS_NODEFILE +devices
>> 0,1,2,3,4,5 namd2
>> +idlepoll
>>
>> Thanks,
>>
>> Gianluca
>>
>>
>> ------------------------------**-----------------------
>> Gianluca Interlandi, PhD
>> gianluca_at_u.washington.edu
>> +1 (206) 685 4435
>>
>> http://artemide.bioeng.**washington.edu/
>>
>> Research Scientist at the Department
>> of
>> Bioengineering
>> at the University of Washington,
>> Seattle WA
>> U.S.A.
>>
>> ------------------------------**-----------------------
>>
>>
>>
>>
>> --
>> Aron Broom M.Sc
>> PhD Student
>> Department of Chemistry
>> University of Waterloo
>>
>>
>>
>>
>>
>> ------------------------------**-----------------------
>> Gianluca Interlandi, PhD
>> gianluca_at_u.washington.edu
>> +1 (206) 685 4435
>>
>> http://artemide.bioeng.**washington.edu/
>>
>> Research Scientist at the Department of
>> Bioengineering
>> at the University of Washington, Seattle WA
>> U.S.A.
>>
>> ------------------------------**-----------------------
>>
>>
>>
>>
>> --
>> Aron Broom M.Sc
>> PhD Student
>> Department of Chemistry
>> University of Waterloo
>>
>>
>>
>>
>> ------------------------------**-----------------------
>> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
>> +1 (206) 685 4435
>> http://artemide.bioeng.**washington.edu/
>>
>> Research Scientist at the Department of Bioengineering
>> at the University of Washington, Seattle WA U.S.A.
>> ------------------------------**-----------------------
>>
>>
>>
>>
>> --
>> Aron Broom M.Sc
>> PhD Student
>> Department of Chemistry
>> University of Waterloo
>>
>>
>>
>>
>> ------------------------------**-----------------------
>> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
>> +1 (206) 685 4435
>> http://artemide.bioeng.**washington.edu/
>>
>> Research Scientist at the Department of Bioengineering
>> at the University of Washington, Seattle WA U.S.A.
>> ------------------------------**-----------------------
>>
>>
>>
>>
>> --
>> Aron Broom M.Sc
>> PhD Student
>> Department of Chemistry
>> University of Waterloo
>>
>>
>>
>>
> ------------------------------**-----------------------
> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
> +1 (206) 685 4435
> http://artemide.bioeng.**washington.edu/
>
> Research Scientist at the Department of Bioengineering
> at the University of Washington, Seattle WA U.S.A.
> ------------------------------**-----------------------
>
-- Aron Broom M.Sc PhD Student Department of Chemistry University of Waterloo
This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:22:15 CST