Re: which portions of NAMD are CUDA accelerated?

From: Wenyu Zhong (wenyuzhong_at_gmail.com)
Date: Wed Dec 09 2009 - 23:45:54 CST

Next message: Axel Kohlmeyer: "Re: About "Visualization and Analysis of CPMD data with VMD" tutorial"
Previous message: Aurum Bai: "About "Visualization and Analysis of CPMD data with VMD" tutorial"
In reply to: Tian, Pu (NIH/NIDDK) [C]: "RE: which portions of NAMD are CUDA accelerated?"
Next in thread: Paul Rigor (uci): "Re: which portions of NAMD are CUDA accelerated?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

Hi Pu,

I'm sorry i didn't metion it in a clear manner. Actually all of the
two set of data are performed with cpu + gpu. With just one core of
AthlonXII_at_2.8G, and without gpu, the apoa1 benchmark is 1.57s/step .

cpu Freq -|- 1 cpucore + 1gpu -|- 2 cpucore + 1gpu
--------------|---s/step --|-speeding--|- s/step --|- speeding
2.8 --------|- 0.1894 -|----- 1 -----|- 0.1597 -|- 1
3.08 ------|- 0.1806 -|- 1.049 --|- 0.1539 -|- 1.038
3.36 ------|- 0.1731 --|- 1.094 --|- 0.1493 -|- 1.090

Wenyu

2009/12/9, Tian, Pu (NIH/NIDDK) [C] <tianpu_at_niddk.nih.gov>:
> Hi Wenyu,
>
> Do you have any data regarding the difference between 2CPU+GPU and 2CPU
> only? If the first set of data is CPU only, then GPU did not speed up your
> simulation at all.
>
> Thanks,
>
> Pu
> ________________________________________
> From: owner-namd-l_at_ks.uiuc.edu [owner-namd-l_at_ks.uiuc.edu] On Behalf Of Wenyu
> Zhong [wenyuzhong_at_gmail.com]
> Sent: Wednesday, December 09, 2009 1:29 AM
> To: namd-l_at_ks.uiuc.edu
> Subject: Re: namd-l: which portions of NAMD are CUDA accelerated?
>
> Hi，everyone,
>
> I'd like to share my benchmark results using NAMD 2.7b2 CUDA. I have
> overlocked or downlocked the CPU/GPU to try to find out the best
> combination.
>
> Here is the hardware sets.
>
> athlonX2 240_at_2.8G//1G DDR2//AMD 770//1*GTX260 216sp @655/1404/1050
>
> The standard apoa1 bechmark is used except the outputenergy is changed to
> 500.
>
> First, the frequency of the CPU is set to 2.8, 3.08 and, 3.36. One or
> two process is used. As shown below, about 20% performance gain is
> obatined while a cpu is added, and just 5% performance gain obtained
> while 10% cpu speed is increased.
>
> Freq --|---------- 1 core --------|------------- 2 core
> ---------|---s/step --|-speeding-|- s/step --|- speeding
> 2.8 ---|- 0.1894 -|----- 1 ----|- 0.1597 -|- 1
> 3.08 -|- 0.1806 -|- 1.049 -|- 0.1539 -|- 1.038
> 3.36 -|- 0.1731 --|- 1.094 -|- 0.1493 -|- 1.090
>
> Second, the core/mem speed of the GPU is downgraded to 95%, 90%, 85%
> of the original sets (NVIDIA does not provide an approach to directly
> tune the shader speed). Two process and a cpu speed of 3.36G are
> assumed. Seems corresponding perfomance loss are given.
>
> | Core | Mem | ratio | s/step --| speeding
> | 655 -| 1050 |-- 1 ---| 0.1474 | 1
> | 621 -| 988 --| 0.95 | 0.1529 | 0.964
> | 589 -| 945 --| 0.9 --| 0.1629 | 0.905
> | 558 -| 895 --| 0.85 | 0.1695 | 0.870
>
> Obviously increasing the GPU power could get corresponding performance
> gain. And amd cpu may be enough for this single gtx260 system.
>
> ps. the apoa1 benchmark on this athlon_at_2.8G system with just one core
> using NAMD2.7b1-x64-tcp, is 1.57 s/step, about 10% speed of the 2cpu
> core + 1 GPU system.
>
> Hope this helpful.
>
> Wenyu
>
> 2009/12/5 Paul Rigor (uci-ics)
> <prigor_at_ics.uci.edu<mailto:prigor_at_ics.uci.edu>>
> Hmm... I'm not sure how to answer that but I ran the latest version of NAMD
> (2.7b2) CUDA and non-CUDA with the following (non-beefy) hardware
> configuration and simulation system. The two processes sharing one single
> GPU device had over 2X speed up over the cpu-only processes. I'm not sure
> if you can extrapolate the amount of time spent on non-bonded forces, but it
> sure does cut down the compute time by half.. For kicks, I also ran the
> same simulation on a beefy Sun X4150 server. It outperforms the dinky
> desktop by 1/3 though it uses 8 cores to do so.
>
>
> For a larger simulation system (NAMD, 2.7b2-TCP), I'm actually using six of
> these machines for a system with 5X the number of atoms. I'm not liking the
> amount of time spent passing messages. So, for minimization of 40K atoms
> (5000 steps) on a single Sun server, I get
>
> WallClock: 1469.357788s CPUTime: 589.791321s
>
> but on 6 nodes running 8 processes each, I get
>
> WallClock: 1135.021240s CPUTime: 199.707642s
>
> I don't know what the nice number of for the number of nodes (and
> cores/node) vs. the number of atoms to achieve optimum performance. In any
> case, I'm looking forward to setting up GPU cluster with Nehalem chipsets
> and 2x GPU devices!
>
> Sorry for the digression!
>
> Cheers,
> Paul
>
> ===STATS BELOW!===
>
> ===Beefy hardware==
> Sun X4150 (1333Mhz FSB)
> Gentoo Linux
> 2x Quad-core Intel Xeon CPU E5450 @ 3.00GHz
> 16GB DDR2 (667Mhz)
> (The clusters are connected via 1GbE through a switch with a 10GbE
> backplane... sorry, no Infinibad interconnect!)
>
>
> ===Not-so-beefy hardware, but CUDA-equipped===
> Dell Vostro 220 mini
> Fedora Core 11
> Intel㈢ G45 Express Chipset
> Intel Core2 Duo CPU E7400 @ 2.80GHz
> 4GB DDR2 RAM (800Mhz)
> NVidia GeForce GTX 260 (192 PE, 896MB DDR3 RAM)
>
>
> ===MD (5 million steps; 1 ns duration; already minimized; water box 5A,
> NVT)===
> Info: SUMMARY OF PARAMETERS:
> Info: 307 BONDS
> Info: 769 ANGLES
> Info: 1254 DIHEDRAL
> Info: 81 IMPROPER
> Info: 6 CROSSTERM
> Info: 190 VDW
> Info: 0 VDW_PAIRS
> Info: TIME FOR READING PSF FILE: 0.0717082
> Info: TIME FOR READING PDB FILE: 0.0138652
> Info:
> Info: Reading from binary file xxx.restart.coor
> Info: ****************************
> Info: STRUCTURE SUMMARY:
> Info: 8389 ATOMS
> Info: 5964 BONDS
> Info: 4429 ANGLES
> Info: 2913 DIHEDRALS
> Info: 176 IMPROPERS
> Info: 66 CROSSTERMS
> Info: 0 EXCLUSIONS
> Info: 7845 RIGID BONDS
> Info: 17322 DEGREES OF FREEDOM
> Info: 2976 HYDROGEN GROUPS
> Info: TOTAL MASS = 51577.3 amu
> Info: TOTAL CHARGE = 1.02818e-06 e
> Info: MASS DENSITY = 0.914731 g/cm^3
> Info: ATOM DENSITY = 0.0895954 atoms/A^3
>
> ===Results of MD===
> Desktop, CUDA-enabled, +p2 ++local
> WallClock: 15890.967773 CPUTime: 15880.965820 Memory: 14.010666 MB
>
> Desktop, CPU-only, +p2 ++local
> WallClock: 36608.851562 CPUTime: 36044.246094 Memory: 20.907639 MB
>
> Sun X4150 server, CPU-only, +p8 ++local
> WallClock: 10322.159180 CPUTime: 9856.663086 Memory: 12.163300 MB
>
>
>
> On Thu, Dec 3, 2009 at 6:29 PM, Biff Forbush
> <biff.forbush_at_yale.edu<mailto:biff.forbush_at_yale.edu>> wrote:
> Is there an estimate of how much of the total calculation time is taken by
> the real space part of non-bonded forces with cpu alone?....recognizing that
> this will be machine and problem-size dependent, is the answer available for
> benchmark examples?
> regards,
> biff
>
>
> Axel Kohlmeyer wrote:
> On Wed, 2009-12-02 at 18:00 -0800, Paul Rigor (uci) wrote:
>
> Hi,
>
>
> Was wondering if there's a break down of the portions of NAMD that
> currently CUDA accelerated?
>
>
> very simple: the calculation of the real space part of the non-bonded
> forces.
>
> cheers,
> axel.
>
>
> Thanks!
> Paul
>
> --
> Paul Rigor Pre-doctoral BIT Fellow and Graduate Student Institute for
> Genomics and Bioinformatics Donald Bren School of Information and Computer
> Sciences University of California, Irvine
> http://www.ics.uci.edu/~prigor>
>
>
>
>
>
>
> --
> Paul Rigor
> Pre-doctoral BIT Fellow and Graduate Student
> Institute for Genomics and Bioinformatics
> Donald Bren School of Information and Computer Sciences
> University of California, Irvine
> http://www.ics.uci.edu/~prigor>
>
>

Next message: Axel Kohlmeyer: "Re: About "Visualization and Analysis of CPMD data with VMD" tutorial"
Previous message: Aurum Bai: "About "Visualization and Analysis of CPMD data with VMD" tutorial"
In reply to: Tian, Pu (NIH/NIDDK) [C]: "RE: which portions of NAMD are CUDA accelerated?"
Next in thread: Paul Rigor (uci): "Re: which portions of NAMD are CUDA accelerated?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:53:35 CST