RE: which portions of NAMD are CUDA accelerated?

From: Tian, Pu (NIH/NIDDK) [C] (tianpu_at_niddk.nih.gov)
Date: Wed Dec 09 2009 - 08:08:28 CST

Next message: Hugh Martin: "'dSmooth' no longer in the ABF procedure?"
Previous message: Anton Arkhipov: "Re: vmd-l: Restraints on coarse grained model"
In reply to: Wenyu Zhong: "Re: which portions of NAMD are CUDA accelerated?"
Next in thread: Wenyu Zhong: "Re: which portions of NAMD are CUDA accelerated?"
Reply: Wenyu Zhong: "Re: which portions of NAMD are CUDA accelerated?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

Hi Wenyu,

Do you have any data regarding the difference between 2CPU+GPU and 2CPU only? If the first set of data is CPU only, then GPU did not speed up your simulation at all.

Thanks,

Pu
________________________________________
From: owner-namd-l_at_ks.uiuc.edu [owner-namd-l_at_ks.uiuc.edu] On Behalf Of Wenyu Zhong [wenyuzhong_at_gmail.com]
Sent: Wednesday, December 09, 2009 1:29 AM
To: namd-l_at_ks.uiuc.edu
Subject: Re: namd-l: which portions of NAMD are CUDA accelerated?

Hi，everyone,

I'd like to share my benchmark results using NAMD 2.7b2 CUDA. I have
overlocked or downlocked the CPU/GPU to try to find out the best
combination.

Here is the hardware sets.

athlonX2 240_at_2.8G//1G DDR2//AMD 770//1*GTX260 216sp @655/1404/1050

The standard apoa1 bechmark is used except the outputenergy is changed to 500.

First, the frequency of the CPU is set to 2.8, 3.08 and, 3.36. One or
two process is used. As shown below, about 20% performance gain is
obatined while a cpu is added, and just 5% performance gain obtained
while 10% cpu speed is increased.

Freq --|---------- 1 core --------|------------- 2 core
---------|---s/step --|-speeding-|- s/step --|- speeding
2.8 ---|- 0.1894 -|----- 1 ----|- 0.1597 -|- 1
3.08 -|- 0.1806 -|- 1.049 -|- 0.1539 -|- 1.038
3.36 -|- 0.1731 --|- 1.094 -|- 0.1493 -|- 1.090

Second, the core/mem speed of the GPU is downgraded to 95%, 90%, 85%
of the original sets (NVIDIA does not provide an approach to directly
tune the shader speed). Two process and a cpu speed of 3.36G are
assumed. Seems corresponding perfomance loss are given.

| Core | Mem | ratio | s/step --| speeding
| 655 -| 1050 |-- 1 ---| 0.1474 | 1
| 621 -| 988 --| 0.95 | 0.1529 | 0.964
| 589 -| 945 --| 0.9 --| 0.1629 | 0.905
| 558 -| 895 --| 0.85 | 0.1695 | 0.870

Obviously increasing the GPU power could get corresponding performance
gain. And amd cpu may be enough for this single gtx260 system.

ps. the apoa1 benchmark on this athlon_at_2.8G system with just one core
using NAMD2.7b1-x64-tcp, is 1.57 s/step, about 10% speed of the 2cpu
core + 1 GPU system.

Hope this helpful.

Wenyu

2009/12/5 Paul Rigor (uci-ics) <prigor_at_ics.uci.edu<mailto:prigor_at_ics.uci.edu>>
Hmm... I'm not sure how to answer that but I ran the latest version of NAMD (2.7b2) CUDA and non-CUDA with the following (non-beefy) hardware configuration and simulation system. The two processes sharing one single GPU device had over 2X speed up over the cpu-only processes. I'm not sure if you can extrapolate the amount of time spent on non-bonded forces, but it sure does cut down the compute time by half.. For kicks, I also ran the same simulation on a beefy Sun X4150 server. It outperforms the dinky desktop by 1/3 though it uses 8 cores to do so.

For a larger simulation system (NAMD, 2.7b2-TCP), I'm actually using six of these machines for a system with 5X the number of atoms. I'm not liking the amount of time spent passing messages. So, for minimization of 40K atoms (5000 steps) on a single Sun server, I get

WallClock: 1469.357788s CPUTime: 589.791321s

but on 6 nodes running 8 processes each, I get

WallClock: 1135.021240s CPUTime: 199.707642s

I don't know what the nice number of for the number of nodes (and cores/node) vs. the number of atoms to achieve optimum performance. In any case, I'm looking forward to setting up GPU cluster with Nehalem chipsets and 2x GPU devices!

Sorry for the digression!

Cheers,
Paul

===STATS BELOW!===

===Beefy hardware==
Sun X4150 (1333Mhz FSB)
Gentoo Linux
2x Quad-core Intel Xeon CPU E5450 @ 3.00GHz
16GB DDR2 (667Mhz)
(The clusters are connected via 1GbE through a switch with a 10GbE backplane... sorry, no Infinibad interconnect!)

===Not-so-beefy hardware, but CUDA-equipped===
Dell Vostro 220 mini
Fedora Core 11
Intel® G45 Express Chipset
Intel Core2 Duo CPU E7400 @ 2.80GHz
4GB DDR2 RAM (800Mhz)
NVidia GeForce GTX 260 (192 PE, 896MB DDR3 RAM)

===MD (5 million steps; 1 ns duration; already minimized; water box 5A, NVT)===
Info: SUMMARY OF PARAMETERS:
Info: 307 BONDS
Info: 769 ANGLES
Info: 1254 DIHEDRAL
Info: 81 IMPROPER
Info: 6 CROSSTERM
Info: 190 VDW
Info: 0 VDW_PAIRS
Info: TIME FOR READING PSF FILE: 0.0717082
Info: TIME FOR READING PDB FILE: 0.0138652
Info:
Info: Reading from binary file xxx.restart.coor
Info: ****************************
Info: STRUCTURE SUMMARY:
Info: 8389 ATOMS
Info: 5964 BONDS
Info: 4429 ANGLES
Info: 2913 DIHEDRALS
Info: 176 IMPROPERS
Info: 66 CROSSTERMS
Info: 0 EXCLUSIONS
Info: 7845 RIGID BONDS
Info: 17322 DEGREES OF FREEDOM
Info: 2976 HYDROGEN GROUPS
Info: TOTAL MASS = 51577.3 amu
Info: TOTAL CHARGE = 1.02818e-06 e
Info: MASS DENSITY = 0.914731 g/cm^3
Info: ATOM DENSITY = 0.0895954 atoms/A^3

===Results of MD===
Desktop, CUDA-enabled, +p2 ++local
WallClock: 15890.967773 CPUTime: 15880.965820 Memory: 14.010666 MB

Desktop, CPU-only, +p2 ++local
WallClock: 36608.851562 CPUTime: 36044.246094 Memory: 20.907639 MB

Sun X4150 server, CPU-only, +p8 ++local
WallClock: 10322.159180 CPUTime: 9856.663086 Memory: 12.163300 MB

On Thu, Dec 3, 2009 at 6:29 PM, Biff Forbush <biff.forbush_at_yale.edu<mailto:biff.forbush_at_yale.edu>> wrote:
Is there an estimate of how much of the total calculation time is taken by the real space part of non-bonded forces with cpu alone?....recognizing that this will be machine and problem-size dependent, is the answer available for benchmark examples?
regards,
biff

Axel Kohlmeyer wrote:
On Wed, 2009-12-02 at 18:00 -0800, Paul Rigor (uci) wrote:

Hi,

Was wondering if there's a break down of the portions of NAMD that
currently CUDA accelerated?

very simple: the calculation of the real space part of the non-bonded forces.

cheers,
axel.

Thanks!
Paul

--
Paul Rigor Pre-doctoral BIT Fellow and Graduate Student Institute for Genomics and Bioinformatics Donald Bren School of Information and Computer Sciences University of California, Irvine
http://www.ics.uci.edu/~prigor>
--
Paul Rigor
Pre-doctoral BIT Fellow and Graduate Student
Institute for Genomics and Bioinformatics
Donald Bren School of Information and Computer Sciences
University of California, Irvine
http://www.ics.uci.edu/~prigor>

Next message: Hugh Martin: "'dSmooth' no longer in the ABF procedure?"
Previous message: Anton Arkhipov: "Re: vmd-l: Restraints on coarse grained model"
In reply to: Wenyu Zhong: "Re: which portions of NAMD are CUDA accelerated?"
Next in thread: Wenyu Zhong: "Re: which portions of NAMD are CUDA accelerated?"
Reply: Wenyu Zhong: "Re: which portions of NAMD are CUDA accelerated?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:53:34 CST