Re: which portions of NAMD are CUDA accelerated?

From: Wenyu Zhong (wenyuzhong_at_gmail.com)
Date: Wed Dec 09 2009 - 00:29:25 CST

Next message: Anton Arkhipov: "Re: vmd-l: Restraints on coarse grained model"
Previous message: Priyan Amaras: "CUDA on Win32 NAMD"
In reply to: Paul Rigor (uci-ics): "Re: which portions of NAMD are CUDA accelerated?"
Next in thread: Tian, Pu (NIH/NIDDK) [C]: "RE: which portions of NAMD are CUDA accelerated?"
Reply: Tian, Pu (NIH/NIDDK) [C]: "RE: which portions of NAMD are CUDA accelerated?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

Hi，everyone,

I'd like to share my benchmark results using NAMD 2.7b2 CUDA. I have
overlocked or downlocked the CPU/GPU to try to find out the best
combination.

Here is the hardware sets.

athlonX2 240_at_2.8G//1G DDR2//AMD 770//1*GTX260 216sp @655/1404/1050

The standard apoa1 bechmark is used except the outputenergy is changed to
500.

First, the frequency of the CPU is set to 2.8, 3.08 and, 3.36. One or
two process is used. As shown below, about 20% performance gain is
obatined while a cpu is added, and just 5% performance gain obtained
while 10% cpu speed is increased.

Freq --|---------- 1 core --------|------------- 2 core
---------|---s/step --|-speeding-|- s/step --|- speeding
2.8 ---|- 0.1894 -|----- 1 ----|- 0.1597 -|- 1
3.08 -|- 0.1806 -|- 1.049 -|- 0.1539 -|- 1.038
3.36 -|- 0.1731 --|- 1.094 -|- 0.1493 -|- 1.090

Second, the core/mem speed of the GPU is downgraded to 95%, 90%, 85%
of the original sets (NVIDIA does not provide an approach to directly
tune the shader speed). Two process and a cpu speed of 3.36G are
assumed. Seems corresponding perfomance loss are given.

| Core | Mem | ratio | s/step --| speeding
| 655 -| 1050 |-- 1 ---| 0.1474 | 1
| 621 -| 988 --| 0.95 | 0.1529 | 0.964
| 589 -| 945 --| 0.9 --| 0.1629 | 0.905
| 558 -| 895 --| 0.85 | 0.1695 | 0.870

Obviously increasing the GPU power could get corresponding performance
gain. And amd cpu may be enough for this single gtx260 system.

ps. the apoa1 benchmark on this athlon_at_2.8G system with just one core
using NAMD2.7b1-x64-tcp, is 1.57 s/step, about 10% speed of the 2cpu
core + 1 GPU system.

Hope this helpful.

Wenyu

2009/12/5 Paul Rigor (uci-ics) <prigor_at_ics.uci.edu>

> Hmm... I'm not sure how to answer that but I ran the latest version of NAMD
> (2.7b2) CUDA and non-CUDA with the following (non-beefy) hardware
> configuration and simulation system. The two processes sharing one single
> GPU device had over 2X speed up over the cpu-only processes. I'm not sure
> if you can extrapolate the amount of time spent on non-bonded forces, but it
> sure does cut down the compute time by half.. For kicks, I also ran the
> same simulation on a beefy Sun X4150 server. It outperforms the dinky
> desktop by 1/3 though it uses 8 cores to do so.
>
>
> For a larger simulation system (NAMD, 2.7b2-TCP), I'm actually using six of
> these machines for a system with 5X the number of atoms. I'm not liking the
> amount of time spent passing messages. So, for minimization of 40K atoms
> (5000 steps) on a single Sun server, I get
>
> WallClock: 1469.357788s CPUTime: 589.791321s
>
> but on 6 nodes running 8 processes each, I get
>
> WallClock: 1135.021240s CPUTime: 199.707642s
>
> I don't know what the nice number of for the number of nodes (and
> cores/node) vs. the number of atoms to achieve optimum performance. In any
> case, I'm looking forward to setting up GPU cluster with Nehalem chipsets
> and 2x GPU devices!
>
> Sorry for the digression!
>
> Cheers,
> Paul
>
> ===STATS BELOW!===
>
> *===Beefy hardware==*
> Sun X4150 (1333Mhz FSB)
> Gentoo Linux
> 2x Quad-core Intel Xeon CPU E5450 @ 3.00GHz
> 16GB DDR2 (667Mhz)
> (The clusters are connected via 1GbE through a switch with a 10GbE
> backplane... sorry, no Infinibad interconnect!)
>
>
> *===Not-so-beefy hardware, but CUDA-equipped===*
> Dell Vostro 220 mini
> Fedora Core 11
> Intel® G45 Express Chipset
> Intel Core2 Duo CPU E7400 @ 2.80GHz
> 4GB DDR2 RAM (800Mhz)
> NVidia GeForce GTX 260 (192 PE, 896MB DDR3 RAM)
>
>
> *===MD (5 million steps; 1 ns duration; already minimized; water box 5A,
> NVT)===*
> Info: SUMMARY OF PARAMETERS:
> Info: 307 BONDS
> Info: 769 ANGLES
> Info: 1254 DIHEDRAL
> Info: 81 IMPROPER
> Info: 6 CROSSTERM
> Info: 190 VDW
> Info: 0 VDW_PAIRS
> Info: TIME FOR READING PSF FILE: 0.0717082
> Info: TIME FOR READING PDB FILE: 0.0138652
> Info:
> Info: Reading from binary file xxx.restart.coor
> Info: ****************************
> Info: STRUCTURE SUMMARY:
> Info: 8389 ATOMS
> Info: 5964 BONDS
> Info: 4429 ANGLES
> Info: 2913 DIHEDRALS
> Info: 176 IMPROPERS
> Info: 66 CROSSTERMS
> Info: 0 EXCLUSIONS
> Info: 7845 RIGID BONDS
> Info: 17322 DEGREES OF FREEDOM
> Info: 2976 HYDROGEN GROUPS
> Info: TOTAL MASS = 51577.3 amu
> Info: TOTAL CHARGE = 1.02818e-06 e
> Info: MASS DENSITY = 0.914731 g/cm^3
> Info: ATOM DENSITY = 0.0895954 atoms/A^3
>
> *===Results of MD===*
> Desktop, CUDA-enabled, +p2 ++local
> WallClock: 15890.967773 CPUTime: 15880.965820 Memory: 14.010666 MB
>
> Desktop, CPU-only, +p2 ++local
> WallClock: 36608.851562 CPUTime: 36044.246094 Memory: 20.907639 MB
>
> Sun X4150 server, CPU-only, +p8 ++local
> WallClock: 10322.159180 CPUTime: 9856.663086 Memory: 12.163300 MB
>
>
>
> On Thu, Dec 3, 2009 at 6:29 PM, Biff Forbush <biff.forbush_at_yale.edu>wrote:
>
>> Is there an estimate of how much of the total calculation time is taken by
>> the real space part of non-bonded forces with cpu alone?....recognizing that
>> this will be machine and problem-size dependent, is the answer available for
>> benchmark examples?
>> regards,
>> biff
>>
>>
>> Axel Kohlmeyer wrote:
>>
>>> On Wed, 2009-12-02 at 18:00 -0800, Paul Rigor (uci) wrote:
>>>
>>>
>>>> Hi,
>>>>
>>>>
>>>> Was wondering if there's a break down of the portions of NAMD that
>>>> currently CUDA accelerated?
>>>>
>>>>
>>>
>>> very simple: the calculation of the real space part of the non-bonded
>>> forces.
>>>
>>> cheers,
>>> axel.
>>>
>>>
>>>
>>>> Thanks!
>>>> Paul
>>>>
>>>> --
>>>> Paul Rigor Pre-doctoral BIT Fellow and Graduate Student Institute for
>>>> Genomics and Bioinformatics Donald Bren School of Information and Computer
>>>> Sciences University of California, Irvine
>>>> http://www.ics.uci.edu/~prigor <http://www.ics.uci.edu/%7Eprigor>
>>>>
>>>>
>>>>
>>>
>>
>
>
> --
> Paul Rigor
> Pre-doctoral BIT Fellow and Graduate Student
> Institute for Genomics and Bioinformatics
> Donald Bren School of Information and Computer Sciences
> University of California, Irvine
> http://www.ics.uci.edu/~prigor <http://www.ics.uci.edu/%7Eprigor>
>

Next message: Anton Arkhipov: "Re: vmd-l: Restraints on coarse grained model"
Previous message: Priyan Amaras: "CUDA on Win32 NAMD"
In reply to: Paul Rigor (uci-ics): "Re: which portions of NAMD are CUDA accelerated?"
Next in thread: Tian, Pu (NIH/NIDDK) [C]: "RE: which portions of NAMD are CUDA accelerated?"
Reply: Tian, Pu (NIH/NIDDK) [C]: "RE: which portions of NAMD are CUDA accelerated?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:53:34 CST