From: Roman Petrenko (rpetrenko_at_gmail.com)
Date: Sun Aug 30 2009 - 16:06:26 CDT
please update this post when you'll have benchmarks for larger (>200k)
and smaller (<35k atoms) systems.
On Sun, Aug 30, 2009 at 4:54 AM, Mike Kuiper<mike_at_vpac.org> wrote:
> Hello Namd users,
> Following recent discussion on namd2.7b1 cuda performance I thought I'd add some of my own numbers for those who are interested.
> I have been running extensive namd2.7b1 benchmarks on the apoa1 example on a range of GPU-enabled hardware rigs and input configurations
> that show some interesting performance characteristics.
> We have used 4 hardware configurations with each box based around an
> Asus p5Q Pro Motherboard equipped with an Intel Quadcore Q8200 Processor with 4GB Kingston DDR2 RAM
> Box 1) 1 x GTX280 1GB
> Box 2) 2 x GTX280 2x1GB
> Box 3) 1 x GTX285 1GB
> Box 4) 1 x Tesla C1060 4GB
> I also used 8cpu on our regular cluster for comparison. ( 2 x 2.4GHz AMD Shanghai Quadcores, 32 GB, Infiniband)
> The configuration files were similar to the apoa1 benchmark configurations except by varying the cutoff values
> from 7 to 24 A in 0.5 increments such that the switching distance was (cutoff - 2) and the pairlist distance was (cutoff + 1.5)
> Additionally, every on/off combination of the twoAwayX , twoAwayY and twoAwayZ flags was used.
> PME was also enabled in every configuration file and all jobs were run using namd2.7b1.
> As there is a lot of raw data, I've made some plots of the results:
> Figure 1 can be found at:
> Figure 2 can be found at:
> The first figure shows the performance of varying these parameters in configuration file running on box 1
> on one cpu. (aaa represents no twoAway flags enabled, aaZ represents twoAwayZ on, aYZ represents twoAwayY on/twoAwayZ on, etc)
> The second figure shows the performance of one particular configuration, (twoAwayX on/twoAwayY on, ie: XYa) on the various
> hardware as a plot of performance (seconds/step) vs the cutoff value in the configuration file.
> (as well as an unoptimized job on the gtx 280 with all twoAway flags we turned off).
> Datapoints of configuration files that failed to launch are omitted.
> All GPU-enabled jobs were run with a 1:1 cpu:GPU ratio.
> Interesting to note is the stepwise nature of performance on the GPUs vs the increasing cutoff in the configuration files
> compared to the smooth decreasing performance as an 8cpu job on a regular cluster, possibly due to how the patches are assigned
> and offloaded to the GPUs (though I'm not expert on that!).
> Some other observations:
> From the graphs we can see that the GTX280 performs very similar to the Tesla, and that the GTX285 is slightly faster.
> The fastest hardware configuration we had was the box containing the two 280 cards, where the job was launched on 2 cpu with each cpu bound to a gpu.
> Enabling various combinations of the twoAway flags can have a dramatic effect on the performance, especially at larger cutoff values.
> GPU performance is not smoothly proportional to cutoff parameters; - it seems well worth optimizing your configuration file!
> Some configurations that fail on a single cpu/GPU seem to work fine when the 2cpu/2GPU hardware configuration is used.
> I am also running benchmarks on a larger systems (199,501 atom) plus a smaller system (36,664 atom) and hope to post those results next week.
> I'd appreciate any comments or suggestions you may have!
> Best regards,
> Michael Kuiper, PhD
> Molecular modelling Scientist
> Victorian Partnership for Advanced Computing
> 110 Victoria Street, PO Box 201
> Carlton South, Melbourne, Victoria, Australia. 3053
> Ph : (613) 9925 4905
> Fax: (613) 9925 4647
-- Roman Petrenko Physics Department University of Cincinnati
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:53:14 CST