Re: Cuda benchmarks namd2.7b1 - effects of configuration parameters and hardware.

From: Gianluca Interlandi (gianluca_at_u.washington.edu)
Date: Sun Aug 30 2009 - 23:55:57 CDT

I might actually be wrong about the nehalem. The nehalem E5530 2.4 GHz
might be faster than the Harpertown E5430 2.66 GHz and it's roughly the
same price. So, nehalem is probably the way to go to build a fast box for
NAMD. And you can always throw in one or two GTX285 if the motherboard has
16xPCIe 2.0.

Gianluca

On Sun, 30 Aug 2009, Gianluca Interlandi wrote:

> Thanks Axel,
>
> I have also compared the cost of a GPU equiped PC versus a dual Quadcore PC.
> You could build a PC with the Asus p5Q Pro Motherboard, Intel Quadcore Q8200
> and two GTX 285 cards for under 900$ (plus the cost of the usual PC
> components). On the other hand, for 1500$ you could get two Xeon Quadcore
> 5430 2.66 GHz + motherboard which is faster than the GPU equiped PC. (I think
> that nehalems are still too expensive.)
>
> So, in summary: if you want a fast box you rather go with conventional CPU,
> if you want a good performance/cost ratio you go with GPU. Also with the GPU
> PC you could run two jobs at the same time which are only slightly slower
> than the dual Quadcore PC. On the other hand, the dual Quadcore PC can be
> also used for other tasks, like implicit solvent MD or APBS, when you are not
> using NAMD. So, it all depends on your needs.
>
> Gianluca
>
> On Sun, 30 Aug 2009, Axel Kohlmeyer wrote:
>
>> On Sun, 2009-08-30 at 12:11 -0700, Gianluca Interlandi wrote:
>> > Thank you Mike for sharing your information.
>> >
>> > According to your banchmarks, it looks like the less expensive GTX285
>> > (around 300$) performs better than a Tesla C1060 which is 4 times more
>> > expensive.
>>
>> gianluca,
>>
>> this is a correct observation. the GTX 285 is currently
>> the fastest nvidia GPU available.
>>
>> > However, as Axel pointed out, the hardware configuration might play a
>> > major role. Maybe, there is a configuration where the much more
>> > expensive
>> > Tesla would have a clear advantage over the GTX280/285.
>>
>> not really. the advantages of the teslas lie elsewhere. first of all
>> the C1060s (and the S1070 are basically just a case with a power supply,
>> two PCIe bridges, connectors, and 4 C1060s) come with 4GB of memory
>> compared to the (typical) 1GB of the GTX 285. for codes that run
>> entirely on the GPU and/or have large data sets, this matters a lot.
>>
>> classical MD, and particularly the way how NAMD (currently) uses the
>> GPU is not a good case for that. one important application of GPUs is
>> fast processing of MRI or similar data. there every byte counts and the
>> 4GB of the tesla are essential. ..and in general, the largest benefits
>> from GPUs are achieved when processing large data sets with high
>> computational complexity, with classical MD running in parallel you
>> don't have this. which is one of the reasons, why it is so difficult to
>> get high speedups.
>>
>> the tesla also are more thoroughly tested and certified and the
>> components used should be less sensitive to (minor) failures.
>> in graphics output it doesn't matter much if a pixel here or
>> there is computed (a bit) wrong, for GPU computing that matters.
>> the teslas are a bit lower clocked than the high end consumer models,
>> which explains the faster benchmarks with the GTX 285.
>>
>> as for how much hardware matters, i only have data for
>> the HOOMD-blue code. this code currently only supports
>> a few potential types and interactions like LJ and and the
>> coarse grain potential developed in our group :), but runs
>> entirely on the GPU and thus achives much higher speedups
>> (over 60x under favorable circumstances).
>> please compare to the results published on the official website at:
>> http://codeblue.umich.edu/hoomd-blue/benchmarks.html
>>
>> i have been using HOOMD-blue 0.8.2 (= head of hoomd-0.8 branch)
>> with CUDA-2.2
>> Host Processor Bus GPU # Polymer TPS LJ
>> liquid TPS
>> Intel Woodcrest 2.66GHz 16xPCIe GTX 285 1 335.18
>> 341.93
>> Intel Harpertown 2.50GHz 16xPCIe TeslaC1060 1 286.75
>> 292.94
>> Intel Harpertown 2.50GHz 16xPCIe TeslaC1060 2 326.16
>> 334.35
>> Intel Harpertown 2.50GHz 16xPCIe TeslaC1060 3 316.02
>> 336.07
>> Intel Harpertown 2.66Ghz 16xPCIe GTX 295 1 285.89
>> 289.59
>> Intel Harpertown 2.66Ghz 16xPCIe GTX 295 2 352.16
>> 366.47
>> Intel Harpertown 2.66GHz 8xPCIe TeslaS1070 1 276.72
>> 281.02
>> Intel Harpertown 2.66GHz 8xPCIe TeslaS1070 2 226.43
>> 231.34
>> Intel Nehalem 2.66GHz 8xPCIe TeslaS1070 1 288.39
>> 292.86
>> Intel Nehalem 2.66GHz 8xPCIe TeslaS1070 2 343.13
>> 341.01
>> Intel Nehalem 2.80GHz 16xPCIe TeslaM1060 1 299.72
>> 303.58
>> Intel Nehalem 2.80GHz 16xPCIe TeslaM1060 2 417.04
>> 416.45
>>
>> cheers,
>> axel.
>>
>>
>>
>> >
>> > Gianluca
>> >
>> >
>> > On Sun, 30 Aug 2009, Mike Kuiper wrote:
>> >
>> > > Hello Namd users,
>> > >
>> > > Following recent discussion on namd2.7b1 cuda performance I thought
>> > > I'd add some of my own numbers for those who are interested.
>> > > I have been running extensive namd2.7b1 benchmarks on the apoa1
>> > > example on a range of GPU-enabled hardware rigs and input
>> > > configurations
>> > > that show some interesting performance characteristics.
>> > >
>> > > We have used 4 hardware configurations with each box based around an
>> > > Asus p5Q Pro Motherboard equipped with an Intel Quadcore Q8200
>> > > Processor with 4GB Kingston DDR2 RAM
>> > >
>> > > Box 1) 1 x GTX280 1GB
>> > > Box 2) 2 x GTX280 2x1GB
>> > > Box 3) 1 x GTX285 1GB
>> > > Box 4) 1 x Tesla C1060 4GB
>> > >
>> > > I also used 8cpu on our regular cluster for comparison. ( 2 x 2.4GHz
>> > > AMD Shanghai Quadcores, 32 GB, Infiniband)
>> > >
>> > > The configuration files were similar to the apoa1 benchmark
>> > > configurations except by varying the cutoff values
>> > > from 7 to 24 A in 0.5 increments such that the switching distance was
>> > > (cutoff - 2) and the pairlist distance was (cutoff + 1.5)
>> > > Additionally, every on/off combination of the twoAwayX , twoAwayY and
>> > > twoAwayZ flags was used.
>> > > PME was also enabled in every configuration file and all jobs were run
>> > > using namd2.7b1.
>> > >
>> > > As there is a lot of raw data, I've made some plots of the results:
>> > > Figure 1 can be found at:
>> > > http://staff.vpac.org/~mike/GPU_results/Fig1_Apo1_GTX280Benchmarks.jpg
>> > >
>> > > Figure 2 can be found at:
>> > > http://staff.vpac.org/~mike/GPU_results/Fig2_Apo1_GPUBenchmarks.jpg
>> > >
>> > >
>> > > The first figure shows the performance of varying these parameters in
>> > > configuration file running on box 1
>> > > on one cpu. (aaa represents no twoAway flags enabled, aaZ represents
>> > > twoAwayZ on, aYZ represents twoAwayY on/twoAwayZ on, etc)
>> > >
>> > > The second figure shows the performance of one particular
>> > > configuration, (twoAwayX on/twoAwayY on, ie: XYa) on the various
>> > > hardware as a plot of performance (seconds/step) vs the cutoff value
>> > > in the configuration file.
>> > > (as well as an unoptimized job on the gtx 280 with all twoAway flags
>> > > we turned off).
>> > > Datapoints of configuration files that failed to launch are omitted.
>> > > All GPU-enabled jobs were run with a 1:1 cpu:GPU ratio.
>> > >
>> > > Interesting to note is the stepwise nature of performance on the GPUs
>> > > vs the increasing cutoff in the configuration files
>> > > compared to the smooth decreasing performance as an 8cpu job on a
>> > > regular cluster, possibly due to how the patches are assigned
>> > > and offloaded to the GPUs (though I'm not expert on that!).
>> > >
>> > > Some other observations:
>> > > From the graphs we can see that the GTX280 performs very similar to
>> > > the Tesla, and that the GTX285 is slightly faster.
>> > > The fastest hardware configuration we had was the box containing the
>> > > two 280 cards, where the job was launched on 2 cpu with each cpu bound
>> > > to a gpu.
>> > > Enabling various combinations of the twoAway flags can have a dramatic
>> > > effect on the performance, especially at larger cutoff values.
>> > > GPU performance is not smoothly proportional to cutoff parameters; -
>> > > it seems well worth optimizing your configuration file!
>> > > Some configurations that fail on a single cpu/GPU seem to work fine
>> > > when the 2cpu/2GPU hardware configuration is used.
>> > >
>> > > I am also running benchmarks on a larger systems (199,501 atom) plus a
>> > > smaller system (36,664 atom) and hope to post those results next week.
>> > >
>> > > I'd appreciate any comments or suggestions you may have!
>> > >
>> > > Best regards,
>> > > Mike
>> > >
>> > > --
>> > > Michael Kuiper, PhD
>> > > Molecular modelling Scientist
>> > > Victorian Partnership for Advanced Computing
>> > > 110 Victoria Street, PO Box 201
>> > > Carlton South, Melbourne, Victoria, Australia. 3053
>> > > www.vpac.org
>> > > Ph : (613) 9925 4905
>> > > Fax: (613) 9925 4647
>> > >
>> > >
>> >
>> > -----------------------------------------------------
>> > Gianluca Interlandi, PhD gianluca_at_u.washington.edu
>> > +1 (206) 685 4435
>> > +1 (206) 714 4303
>> > http://artemide.bioeng.washington.edu/
>> >
>> > Postdoc at the Department of Bioengineering
>> > at the University of Washington, Seattle WA U.S.A.
>> > -----------------------------------------------------
>> >
>> --
>> Dr. Axel Kohlmeyer akohlmey_at_gmail.com
>> Research Associate Professor
>> Institute for Computational Molecular Science
>> College of Science and Technology
>> Temple University, Philadelphia PA, USA.
>>
>>
>
> -----------------------------------------------------
> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
> +1 (206) 685 4435
> +1 (206) 714 4303
> http://artemide.bioeng.washington.edu/
>
> Postdoc at the Department of Bioengineering
> at the University of Washington, Seattle WA U.S.A.
> -----------------------------------------------------
>
>
>

-----------------------------------------------------
Gianluca Interlandi, PhD gianluca_at_u.washington.edu
                     +1 (206) 685 4435
                     +1 (206) 714 4303
                     http://artemide.bioeng.washington.edu/

Postdoc at the Department of Bioengineering
at the University of Washington, Seattle WA U.S.A.
-----------------------------------------------------

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:53:14 CST