Cuda benchmarks namd2.7b1 - effects of configuration parameters and hardware.

From: Mike Kuiper (
Date: Sun Aug 30 2009 - 03:54:57 CDT

Hello Namd users,

Following recent discussion on namd2.7b1 cuda performance I thought I'd add some of my own numbers for those who are interested.
I have been running extensive namd2.7b1 benchmarks on the apoa1 example on a range of GPU-enabled hardware rigs and input configurations
that show some interesting performance characteristics.

We have used 4 hardware configurations with each box based around an
Asus p5Q Pro Motherboard equipped with an Intel Quadcore Q8200 Processor with 4GB Kingston DDR2 RAM

Box 1) 1 x GTX280 1GB
Box 2) 2 x GTX280 2x1GB
Box 3) 1 x GTX285 1GB
Box 4) 1 x Tesla C1060 4GB

I also used 8cpu on our regular cluster for comparison. ( 2 x 2.4GHz AMD Shanghai Quadcores, 32 GB, Infiniband)

The configuration files were similar to the apoa1 benchmark configurations except by varying the cutoff values
from 7 to 24 A in 0.5 increments such that the switching distance was (cutoff - 2) and the pairlist distance was (cutoff + 1.5)
Additionally, every on/off combination of the twoAwayX , twoAwayY and twoAwayZ flags was used.
PME was also enabled in every configuration file and all jobs were run using namd2.7b1.

As there is a lot of raw data, I've made some plots of the results:
Figure 1 can be found at:

Figure 2 can be found at:

The first figure shows the performance of varying these parameters in configuration file running on box 1
on one cpu. (aaa represents no twoAway flags enabled, aaZ represents twoAwayZ on, aYZ represents twoAwayY on/twoAwayZ on, etc)

The second figure shows the performance of one particular configuration, (twoAwayX on/twoAwayY on, ie: XYa) on the various
hardware as a plot of performance (seconds/step) vs the cutoff value in the configuration file.
(as well as an unoptimized job on the gtx 280 with all twoAway flags we turned off).
Datapoints of configuration files that failed to launch are omitted.
All GPU-enabled jobs were run with a 1:1 cpu:GPU ratio.

Interesting to note is the stepwise nature of performance on the GPUs vs the increasing cutoff in the configuration files
compared to the smooth decreasing performance as an 8cpu job on a regular cluster, possibly due to how the patches are assigned
and offloaded to the GPUs (though I'm not expert on that!).

Some other observations:
>From the graphs we can see that the GTX280 performs very similar to the Tesla, and that the GTX285 is slightly faster.
The fastest hardware configuration we had was the box containing the two 280 cards, where the job was launched on 2 cpu with each cpu bound to a gpu.
Enabling various combinations of the twoAway flags can have a dramatic effect on the performance, especially at larger cutoff values.
GPU performance is not smoothly proportional to cutoff parameters; - it seems well worth optimizing your configuration file!
Some configurations that fail on a single cpu/GPU seem to work fine when the 2cpu/2GPU hardware configuration is used.

I am also running benchmarks on a larger systems (199,501 atom) plus a smaller system (36,664 atom) and hope to post those results next week.

I'd appreciate any comments or suggestions you may have!

Best regards,

Michael Kuiper, PhD
Molecular modelling Scientist
Victorian Partnership for Advanced Computing
110 Victoria Street,  PO Box 201
Carlton South, Melbourne, Victoria, Australia. 3053
Ph : (613) 9925 4905
Fax: (613) 9925 4647 

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:53:14 CST