Re: Cuda benchmarks namd2.7b1 - effects of configuration parameters and hardware.

From: Axel Kohlmeyer (akohlmey_at_gmail.com)
Date: Sun Aug 30 2009 - 15:33:08 CDT

On Sun, 2009-08-30 at 12:11 -0700, Gianluca Interlandi wrote:
> Thank you Mike for sharing your information.
>
> According to your banchmarks, it looks like the less expensive GTX285
> (around 300$) performs better than a Tesla C1060 which is 4 times more
> expensive.

gianluca,

this is a correct observation. the GTX 285 is currently
the fastest nvidia GPU available.

> However, as Axel pointed out, the hardware configuration might play a
> major role. Maybe, there is a configuration where the much more expensive
> Tesla would have a clear advantage over the GTX280/285.

not really. the advantages of the teslas lie elsewhere. first of all
the C1060s (and the S1070 are basically just a case with a power supply,
two PCIe bridges, connectors, and 4 C1060s) come with 4GB of memory
compared to the (typical) 1GB of the GTX 285. for codes that run
entirely on the GPU and/or have large data sets, this matters a lot.

 classical MD, and particularly the way how NAMD (currently) uses the
GPU is not a good case for that. one important application of GPUs is
fast processing of MRI or similar data. there every byte counts and the
4GB of the tesla are essential. ..and in general, the largest benefits
from GPUs are achieved when processing large data sets with high
computational complexity, with classical MD running in parallel you
don't have this. which is one of the reasons, why it is so difficult to
get high speedups.

the tesla also are more thoroughly tested and certified and the
components used should be less sensitive to (minor) failures.
in graphics output it doesn't matter much if a pixel here or
there is computed (a bit) wrong, for GPU computing that matters.
the teslas are a bit lower clocked than the high end consumer models,
which explains the faster benchmarks with the GTX 285.
 
as for how much hardware matters, i only have data for
the HOOMD-blue code. this code currently only supports
a few potential types and interactions like LJ and and the
coarse grain potential developed in our group :), but runs
entirely on the GPU and thus achives much higher speedups
(over 60x under favorable circumstances).
please compare to the results published on the official website at:
http://codeblue.umich.edu/hoomd-blue/benchmarks.html

i have been using HOOMD-blue 0.8.2 (= head of hoomd-0.8 branch)
with CUDA-2.2
Host Processor Bus GPU # Polymer TPS LJ
liquid TPS
Intel Woodcrest 2.66GHz 16xPCIe GTX 285 1 335.18
341.93
Intel Harpertown 2.50GHz 16xPCIe TeslaC1060 1 286.75
292.94
Intel Harpertown 2.50GHz 16xPCIe TeslaC1060 2 326.16
334.35
Intel Harpertown 2.50GHz 16xPCIe TeslaC1060 3 316.02
336.07
Intel Harpertown 2.66Ghz 16xPCIe GTX 295 1 285.89
289.59
Intel Harpertown 2.66Ghz 16xPCIe GTX 295 2 352.16
366.47
Intel Harpertown 2.66GHz 8xPCIe TeslaS1070 1 276.72
281.02
Intel Harpertown 2.66GHz 8xPCIe TeslaS1070 2 226.43
231.34
Intel Nehalem 2.66GHz 8xPCIe TeslaS1070 1 288.39
292.86
Intel Nehalem 2.66GHz 8xPCIe TeslaS1070 2 343.13
341.01
Intel Nehalem 2.80GHz 16xPCIe TeslaM1060 1 299.72
303.58
Intel Nehalem 2.80GHz 16xPCIe TeslaM1060 2 417.04
416.45

cheers,
   axel.

>
> Gianluca
>
>
> On Sun, 30 Aug 2009, Mike Kuiper wrote:
>
> > Hello Namd users,
> >
> > Following recent discussion on namd2.7b1 cuda performance I thought I'd add some of my own numbers for those who are interested.
> > I have been running extensive namd2.7b1 benchmarks on the apoa1 example on a range of GPU-enabled hardware rigs and input configurations
> > that show some interesting performance characteristics.
> >
> > We have used 4 hardware configurations with each box based around an
> > Asus p5Q Pro Motherboard equipped with an Intel Quadcore Q8200 Processor with 4GB Kingston DDR2 RAM
> >
> > Box 1) 1 x GTX280 1GB
> > Box 2) 2 x GTX280 2x1GB
> > Box 3) 1 x GTX285 1GB
> > Box 4) 1 x Tesla C1060 4GB
> >
> > I also used 8cpu on our regular cluster for comparison. ( 2 x 2.4GHz AMD Shanghai Quadcores, 32 GB, Infiniband)
> >
> > The configuration files were similar to the apoa1 benchmark configurations except by varying the cutoff values
> > from 7 to 24 A in 0.5 increments such that the switching distance was (cutoff - 2) and the pairlist distance was (cutoff + 1.5)
> > Additionally, every on/off combination of the twoAwayX , twoAwayY and twoAwayZ flags was used.
> > PME was also enabled in every configuration file and all jobs were run using namd2.7b1.
> >
> > As there is a lot of raw data, I've made some plots of the results:
> > Figure 1 can be found at:
> > http://staff.vpac.org/~mike/GPU_results/Fig1_Apo1_GTX280Benchmarks.jpg
> >
> > Figure 2 can be found at:
> > http://staff.vpac.org/~mike/GPU_results/Fig2_Apo1_GPUBenchmarks.jpg
> >
> >
> > The first figure shows the performance of varying these parameters in configuration file running on box 1
> > on one cpu. (aaa represents no twoAway flags enabled, aaZ represents twoAwayZ on, aYZ represents twoAwayY on/twoAwayZ on, etc)
> >
> > The second figure shows the performance of one particular configuration, (twoAwayX on/twoAwayY on, ie: XYa) on the various
> > hardware as a plot of performance (seconds/step) vs the cutoff value in the configuration file.
> > (as well as an unoptimized job on the gtx 280 with all twoAway flags we turned off).
> > Datapoints of configuration files that failed to launch are omitted.
> > All GPU-enabled jobs were run with a 1:1 cpu:GPU ratio.
> >
> > Interesting to note is the stepwise nature of performance on the GPUs vs the increasing cutoff in the configuration files
> > compared to the smooth decreasing performance as an 8cpu job on a regular cluster, possibly due to how the patches are assigned
> > and offloaded to the GPUs (though I'm not expert on that!).
> >
> > Some other observations:
> > From the graphs we can see that the GTX280 performs very similar to the Tesla, and that the GTX285 is slightly faster.
> > The fastest hardware configuration we had was the box containing the two 280 cards, where the job was launched on 2 cpu with each cpu bound to a gpu.
> > Enabling various combinations of the twoAway flags can have a dramatic effect on the performance, especially at larger cutoff values.
> > GPU performance is not smoothly proportional to cutoff parameters; - it seems well worth optimizing your configuration file!
> > Some configurations that fail on a single cpu/GPU seem to work fine when the 2cpu/2GPU hardware configuration is used.
> >
> > I am also running benchmarks on a larger systems (199,501 atom) plus a smaller system (36,664 atom) and hope to post those results next week.
> >
> > I'd appreciate any comments or suggestions you may have!
> >
> > Best regards,
> > Mike
> >
> > --
> > Michael Kuiper, PhD
> > Molecular modelling Scientist
> > Victorian Partnership for Advanced Computing
> > 110 Victoria Street, PO Box 201
> > Carlton South, Melbourne, Victoria, Australia. 3053
> > www.vpac.org
> > Ph : (613) 9925 4905
> > Fax: (613) 9925 4647
> >
> >
>
> -----------------------------------------------------
> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
> +1 (206) 685 4435
> +1 (206) 714 4303
> http://artemide.bioeng.washington.edu/
>
> Postdoc at the Department of Bioengineering
> at the University of Washington, Seattle WA U.S.A.
> -----------------------------------------------------
>

-- 
Dr. Axel Kohlmeyer  akohlmey_at_gmail.com 
Research Associate Professor
Institute for Computational Molecular Science
College of Science and Technology
Temple University, Philadelphia PA, USA.

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:53:14 CST