Re: AW: Running NAMD on Forge (CUDA)

From: Gianluca Interlandi (gianluca_at_u.washington.edu)
Date: Fri Jul 13 2012 - 01:35:49 CDT

Next message: Norman Geist: "AW: AW: Running NAMD on Forge (CUDA)"
Previous message: Norman Geist: "AW: Running NAMD on Forge (CUDA)"
Maybe in reply to: Norman Geist: "AW: Running NAMD on Forge (CUDA)"
Next in thread: Norman Geist: "AW: AW: Running NAMD on Forge (CUDA)"
Reply: Norman Geist: "AW: AW: Running NAMD on Forge (CUDA)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

Hi Norman,

> What value do you use for fullelectfequency??

The default. I haven't set it.

> How many GPUs are there per node in this cluster?

> What kind of interconnect?

Infiniband.

Here are all specs:

http://www.ncsa.illinois.edu/UserInfo/Resources/Hardware/DellNVIDIACluster/TechSummary/index.html

Thanks,

Gianluca

> Norman Geist.
>
>> -----Ursprüngliche Nachricht-----
>> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
>> Auftrag von Gianluca Interlandi
>> Gesendet: Freitag, 13. Juli 2012 00:26
>> An: Aron Broom
>> Cc: NAMD list
>> Betreff: Re: namd-l: Running NAMD on Forge (CUDA)
>>
>> Yes, I was totally surprised, too. I also ran a non-CUDA job on Forge
>> using 16 CPUs. I got 0.122076 s/step, which is 16% slower than using
>> the 6
>> GPUs (0.1 s/step) and a bit slower than 0.10932 s/step that I get on
>> Trestles using 16 cores. This difference might be statistical
>> fluctuations
>> though (or configuration setup) since Forge and Trestles have the exact
>> same CPU, i.e., eight-core 2.4 GHz Magny-Cours.
>>
>> Yes, Forge also uses NVIDIA M2070.
>>>> I keep thinking of this guy here in Seattle who works for NVIDIA
>> downtown
>> and a few years ago he asked me: "How come you don't use CUDA?" Maybe
>> the
>> code still needs some optimization, and CPU manufacturers have been
>> doing
>> everything to catch up.
>>
>> Gianluca
>>
>> On Thu, 12 Jul 2012, Aron Broom wrote:
>>
>>> So your speed for 1 or 2 GPUs (based on what your sent) is about 1.7
>> ns/day, which
>>> seems decent given the system size. I was getting 2.0 and 2.6 ns/day
>> for a 100k atom
>>> system with roughly those same parameters (and also 6-cpu cores), so
>> given a scaling
>>> of ~nlogn, I would expect to see ~1.5 to 2.0 ns/day for you. So in
>> my mind, the
>>> speed you are getting with the GPUs isn't so surprising, it's that
>> you get such a
>>> good speed with only the CPUs that shocks me. In my case I didn't
>> see speeds
>>> matching my 1 GPU until 48 CPU cores alone. Seems like those Magny
>> Cours are pretty
>>> awesome.
>>>
>>> Which GPUs are you using? I was using mainly the M2070s.
>>>
>>> Also, one thing that might be useful, if you are able to get roughly
>> the same speed
>>> with 6 cores and 2 GPUs and you get with 16 cores alone, is to test
>> running 3 jobs at
>>> once, with 5 cores and 2 GPUs assigned to each and see how much
>> slowdown there is.
>>> You might be able to benefit from various replica techniques more
>> than just hitting a
>>> single job with more power.
>>>
>>> Still, the overall conclusion from what you've got seems to be that
>> it makes more
>>> sense to go with more of those CPUs rather than putting GPUs in
>> there.
>>>
>>> ~Aron
>>>
>>> On Thu, Jul 12, 2012 at 4:58 PM, Gianluca Interlandi
>> <gianluca_at_u.washington.edu>
>>> wrote:
>>> What are your simulation parameters:
>>>
>>> timestep (and also any multistepping values)
>>>
>>> 2 fs, SHAKE, no multistepping
>>>
>>> cutoff (and also the pairlist and PME grid spacing)
>>>
>>> 8-10-12 PME grid spacing ~ 1 A
>>>
>>> Have you tried giving it just 1 or 2 GPUs alone (using the
>>> +devices)?
>>>
>>>
>>> Yes, this is the benchmark time:
>>>
>>> np 1: 0.48615 s/step
>>> np 2: 0.26105 s/step
>>> np 4: 0.14542 s/step
>>> np 6: 0.10167 s/step
>>>
>>> I post here also part of the log running on 6 devices (in case it is
>> helpful to
>>> localize the problem):
>>>
>>> Pe 4 has 57 local and 64 remote patches and 1066 local and 473 remote
>> computes.
>>> Pe 1 has 57 local and 65 remote patches and 1057 local and 482 remote
>> computes.
>>> Pe 5 has 57 local and 56 remote patches and 1150 local and 389 remote
>> computes.
>>> Pe 2 has 57 local and 57 remote patches and 1052 local and 487 remote
>> computes.
>>> Pe 3 has 58 local and 57 remote patches and 1079 local and 487 remote
>> computes.
>>> Pe 0 has 57 local and 57 remote patches and 1144 local and 395 remote
>> computes.
>>>
>>> Gianluca
>>>
>>> Gianluca
>>>
>>> On Thu, 12 Jul 2012, Aron Broom wrote:
>>>
>>> have you tried the multicore build? I wonder if
>> the
>>> prebuilt
>>> smp one is just not
>>> working for you.
>>>
>>> On Thu, Jul 12, 2012 at 3:21 PM, Gianluca
>> Interlandi
>>> <gianluca_at_u.washington.edu>
>>> wrote:
>>> are other people also using those GPUs?
>>>
>>>
>>> I don't think so since I reserved the entire node.
>>>
>>> What are the benchmark timings that you are
>> given
>>> after
>>> ~1000
>>> steps?
>>>
>>>
>>> The benchmark time with 6 processes is 101 sec for
>> 1000
>>> steps. This is only
>>> slightly faster than Trestles where I get 109 sec
>> for
>>> 1000
>>> steps running on 16
>>> CPUs. So, yes 6 GPUs on Forge are much faster than
>> 6
>>> cores on
>>> Trestles, but in
>>> terms of SUs it makes no difference, since on Forge
>> I
>>> still
>>> have to reserve the
>>> entire node (16 cores).
>>>
>>> Gianluca
>>>
>>> is some setup time.
>>>
>>> I often run a system of ~100,000 atoms, and I
>>> generally
>>> see an
>>> order of magnitude
>>> improvement in speed compared to the same
>> number
>>> of
>>> cores without
>>> the GPUs. I would
>>> test the non-CUDA precompiled cude on your
>> Forge
>>> system
>>> and see how
>>> that compares, it
>>> might be the fault of something other than
>> CUDA.
>>>
>>> ~Aron
>>>
>>> On Thu, Jul 12, 2012 at 2:41 PM, Gianluca
>>> Interlandi
>>> <gianluca_at_u.washington.edu>
>>> wrote:
>>> Hi Aron,
>>>
>>> Thanks for the explanations. I don't
>> know
>>> whether
>>> I'm doing
>>> everything
>>> right. I don't see any speed advantage
>>> running on
>>> the CUDA
>>> cluster
>>> (Forge) versus running on a non-CUDA
>>> cluster.
>>>
>>> I did the following benchmarks on Forge
>>> (the
>>> system has
>>> 127,000 atoms and
>>> ran for 1000 steps):
>>>
>>> np 1: 506 sec
>>> np 2: 281 sec
>>> np 4: 163 sec
>>> np 6: 136 sec
>>> np 12: 218 sec
>>>
>>> On the other hand, running the same
>> system
>>> on 16
>>> cores of
>>> Trestles (AMD
>>> Magny Cours) takes 129 sec. It seems
>> that
>>> I'm not
>>> really
>>> making good use
>>> of SUs by running on the CUDA cluster.
>> Or,
>>> maybe
>>> I'm doing
>>> something
>>> wrong? I'm using the ibverbs-smp-CUDA
>>> pre-compiled version of
>>> NAMD 2.9.
>>>
>>> Thanks,
>>>
>>> Gianluca
>>>
>>> On Tue, 10 Jul 2012, Aron Broom wrote:
>>>
>>> if it is truly just one node, you
>> can
>>> use
>>> the
>>> multicore-CUDA
>>> version and avoid the
>>> MPI charmrun stuff. Still, it
>> boils
>>> down
>>> to much the
>>> same
>>> thing I think. If you do
>>> what you've done below, you are
>>> running one
>>> job with 12
>>> CPU
>>> cores and all GPUs. If
>>> you don't specify the +devices,
>> NAMD
>>> will
>>> automatically
>>> find
>>> the available GPUs, so I
>>> think the main benefit of
>> specifying
>>> them
>>> is when you
>>> are
>>> running more than one job
>>> and don't want the jobs sharing
>> GPUs.
>>>
>>> I'm not sure you'll see great
>> scaling
>>> across 6 GPUs for
>>> a
>>> single job, but that would
>>> be great if you did.
>>>
>>> ~Aron
>>>
>>> On Tue, Jul 10, 2012 at 1:14 PM,
>>> Gianluca
>>> Interlandi
>>> <gianluca_at_u.washington.edu>
>>> wrote:
>>> Hi,
>>>
>>> I have a question
>> concerning
>>> running
>>> NAMD on a
>>> CUDA
>>> cluster.
>>>
>>> NCSA Forge has for example
>> 6
>>> CUDA
>>> devices and 16
>>> CPU
>>> cores per node. If I
>>> want to use all 6 CUDA
>> devices
>>> in a
>>> node, how
>>> many
>>> processes is it
>>> recommended to spawn? Do I
>> need
>>> to
>>> specify
>>> "+devices"?
>>>
>>> So, if for example I want
>> to
>>> spawn 12
>>> processes,
>>> do I
>>> need to specify:
>>>
>>> charmrun +p12 -machinefile
>>> $PBS_NODEFILE +devices
>>> 0,1,2,3,4,5 namd2
>>> +idlepoll
>>>
>>> Thanks,
>>>
>>> Gianluca
>>>
>>>
>>>
>>> -----------------------------------------------------
>>> Gianluca Interlandi, PhD
>>> gianluca_at_u.washington.edu
>>> +1
>> (206)
>>> 685 4435
>>>
>>>
>>> http://artemide.bioeng.washington.edu/
>>>
>>> Research Scientist at the
>>> Department
>>> of
>>> Bioengineering
>>> at the University of
>>> Washington,
>>> Seattle WA
>>> U.S.A.
>>>
>>>
>>> -----------------------------------------------------
>>>
>>>
>>>
>>>
>>> --
>>> Aron Broom M.Sc
>>> PhD Student
>>> Department of Chemistry
>>> University of Waterloo
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------
>> --
>>> Gianluca Interlandi, PhD
>>> gianluca_at_u.washington.edu
>>> +1 (206) 685 4435
>>>
>>> http://artemide.bioeng.washington.edu/
>>>
>>> Research Scientist at the Department of
>>> Bioengineering
>>> at the University of Washington,
>> Seattle WA
>>> U.S.A.
>>>
>>> ---------------------------------------------------
>> --
>>>
>>>
>>>
>>>
>>> --
>>> Aron Broom M.Sc
>>> PhD Student
>>> Department of Chemistry
>>> University of Waterloo
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------
>> --
>>> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
>>> +1 (206) 685 4435
>>>
>>> http://artemide.bioeng.washington.edu/
>>>
>>> Research Scientist at the Department of
>> Bioengineering
>>> at the University of Washington, Seattle WA U.S.A.
>>> ---------------------------------------------------
>> --
>>>
>>>
>>>
>>>
>>> --
>>> Aron Broom M.Sc
>>> PhD Student
>>> Department of Chemistry
>>> University of Waterloo
>>>
>>>
>>>
>>>
>>> -----------------------------------------------------
>>> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
>>> +1 (206) 685 4435
>>>
>> http://artemide.bioeng.washington.edu/
>>>
>>> Research Scientist at the Department of Bioengineering
>>> at the University of Washington, Seattle WA U.S.A.
>>> -----------------------------------------------------
>>>
>>>
>>>
>>>
>>> --
>>> Aron Broom M.Sc
>>> PhD Student
>>> Department of Chemistry
>>> University of Waterloo
>>>
>>>
>>>
>>>
>>> -----------------------------------------------------
>>> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
>>> +1 (206) 685 4435
>>> http://artemide.bioeng.washington.edu/
>>>
>>> Research Scientist at the Department of Bioengineering
>>> at the University of Washington, Seattle WA U.S.A.
>>> -----------------------------------------------------
>>>
>>>
>>>
>>>
>>> --
>>> Aron Broom M.Sc
>>> PhD Student
>>> Department of Chemistry
>>> University of Waterloo
>>>
>>>
>>>
>>
>> -----------------------------------------------------
>> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
>> +1 (206) 685 4435
>> http://artemide.bioeng.washington.edu/
>>
>> Research Scientist at the Department of Bioengineering
>> at the University of Washington, Seattle WA U.S.A.
>> -----------------------------------------------------
>
>

-----------------------------------------------------
Gianluca Interlandi, PhD gianluca_at_u.washington.edu
+1 (206) 685 4435
http://artemide.bioeng.washington.edu/

Research Scientist at the Department of Bioengineering
at the University of Washington, Seattle WA U.S.A.
-----------------------------------------------------

Next message: Norman Geist: "AW: AW: Running NAMD on Forge (CUDA)"
Previous message: Norman Geist: "AW: Running NAMD on Forge (CUDA)"
Maybe in reply to: Norman Geist: "AW: Running NAMD on Forge (CUDA)"
Next in thread: Norman Geist: "AW: AW: Running NAMD on Forge (CUDA)"
Reply: Norman Geist: "AW: AW: Running NAMD on Forge (CUDA)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:47 CST