NAMD-CUDA benchmarks, dual GTX295, dual Xeon Nehalem system

From: Biff Forbush (biff.forbush_at_yale.edu)
Date: Sat Jan 09 2010 - 22:16:41 CST

Hi All,

I have carried out benchmark tests of a dual Xeon Nehalem system with
two gtx295 gpus. There is quite a bit of data, so I have put together
graphs, available here:

http://sites.google.com/site/namdcuda/namd-benchmarks

Again, the objective was to put together the fastest reasonable single
node system, with maximum performance in cpu-alone mode in case CUDA is
too “beta”. The current system has: Tyan S7025 motherboard (onboard VGA
for console) with dual Xeon Nehalem cpus (3.33GHz, Scythe coolers), dual
X58's (northbridge for PCIE-X16), two Geforce GTX295 (BFG), and 12 Gb
DDR3-1333 MHz mem.

Note that each GTX295 has two GPUs (each like gtx260/280), so we have 8
CPU cores and 4 GPU cores in all.

I performed benchmarks with DHFR (23,558 atoms), er-gre (36,753), ApoA1
(92,224 atoms), F1ATPase (32,7506 atoms), and STMV (106,6628 atoms) (see
below for details). When all is said and done, the best times were as
follows:

Problem DHFR er-gre ApoA1 F1ATPase STMV
Size 23,558 36,753 92,224 327,506 1,066,628
s/step 0.0158 0.0187 0.0501 0.126 0.425
day/ns 0.1835 0.217 0.5802 1.43 4.92

There are a number of interesting observations / interpretations:

1) CPU-alone performance is proportional to number of cores from 1-8,
and as I noted in an earlier post it is also proportional to CPU speed
between 2.0GHz and 3.33GHz (except for small systems) (fig 1, 4).

2) GPU-speedup is 5-6x with 4 fast CPU cores (one cpu core to one gpu
core). Overall the GPU-speedup is between 2x and 9x, depending on
problem size and how slow the computer is to start with (fig 5,6).

3) Scaling with multiple GPU cores is sub-linear –the speed with 4 GPU
cores is only 0.7-0.8x of 4x-single-core. For large systems, the deficit
can be completely erased by 2-fold oversubscribing the GPUs, ie. using 8
cpu cores for 4 gpu cores (fig 2,3).

4) As expected, when using GPUs, CPU speed (2 vs 3.3 GHz) is less
important. Nonetheless, with 4 CPU and 4 GPU cores, a 1.66x increase in
cpu clock rate gives a 1.3x increase in speed for a large system (STMV)
(fig 4). Again, most of this difference can be made up by two-fold
oversubscribing (8 CPU cores, 4 GPU cores) (i.e. a dual 2.0 Xeon is
similar to a single 3.33) (fig 2,3).

5) Following on (3) and (4) , a single mid-range Nehalem (4cores, eg.
2.66 GHz) is just sufficient to get most of the performance out of the
2xGTX295 configuration: I did a few test tests on a Core i7 2.66GHz
(Lynnfield) machine with 4Gb ram with the following best results (at
+p4): DHFR-0.018 s/step, ApoA1-0.055s/step, F1ATPase-0.163 s/step, (stmv
not enough mem), 10-28% off the 3.33GHz dual Nehalem system.

6) With sufficient cpu power, effectiveness of the NAMD-CUDA system
increases somewhat with increasing problem size. The DHFR benchmark does
considerably worse with NAMD-Cuda, presumably because of Amber
differences (rigid water?) (fig 5).

7) 3X and 4X oversubscription of gpus using cpu hyperthreading is a
waste of energy – no surprise here (this is just a good way to heat the
CPUs up another 10oC). Only with STMV is there a very small (3-8%)
increase in speed with 4X over 1X. Without the GPUs, hyperthreading of
STMV with +p16 gives a 17% speedup. (fig 7,8).

8) Memory. GPU memory was limiting only when attempting STMV with +p1
(errored out with malloc error). CPU memory utilization maxed out at 6-7
Gb with STMV.

9) Power. This is not as bad as one might expect from the system specs.
A power meter at the wall registers 660 W running STMV with +p8, (590W
with +p4, 750W with +p16), over a baseline idle of 300 W (CPU-only at
+p8 is 335 W). Consistent with the 50% fan duty-cycle on the GTX295’s,
this suggests that the GTX295’s are operating at well under their rated
power consumption. Both gpus and cps are running at 62-67oC in this
system at +p8 (+p16 takes the cpus to 77oC, not good and not very
useful). [the X58 Northbridge temp problem referred to in a previous
post was easy to solve with a small fan mounted nearby].

Other comments and digressions:
1) All of these benchmarks were run from the console. However, contrary
to warnings in the release notes and on this site, I detected little or
no effect (within +-1% on speed) of any of the gpu benchmarks (from +p1
to +p16) when I ran them from a terminal under GNOME (eg. fig 9, which
also shows performance of 1x GTX295). Perhaps this is due to the fact
that the metacity-GNOME (Arch Linux) Xwindows handler does not use the
gpus for window computation and I did not have any other fancy doodads
running. (MSWindows users, you may find this Arch Linux windows business
to be a real kludge)….(ok you ask, why would I use Arch? – well, I
switched to it after a day of fun fighting with Fedora12 – Nouveau –
NVIDIA trying to get CUDA working, the Fedora folks don’t like the
NVIDIA drivers).

2) The results in fig. 3 suggest that good performance should be
obtained with 4xGTX295 in a 3.33 GHz Nehalem system. This Tyan
motherboard supports 4x PCIE-x16 double slots (using dual X58’s to
handle them), but there are simple physical problems. Doing anything but
putting cards in #1 and # 3 slots results in having units right next to
one another – this is not an option if the fan inlet is on the side of
the unit as it is on my BFG GTX295s (slot #4 also has other physical
conflicts). This could be solved with gpu units with fan inlet on the
edges of the boards, or with water-cooling of gpus, or with other
physical modification and pcix risers.

3) Benchmarks. Wouldn’t it be nice if the benchmarks on the UIUC site
could run on NAMD-CUDA right out of the box – especially ApoA1, which is
billed as “the standard NAMD cross-platform benchmark for years.” Note
that commenting out the NBFIX statements in the xplor file would not
change ApoA1 a bit, and probably would not affect f1ATPase for benchmark
purposes (see below). Alternatively, perhaps a note could be added to
the site (or in the release notes) about how to fix this. (The problem
is not helped by the “NFBIX” typo error message from NAMD-cuda).

4) It looks as though the GTX200 series winds up with the GTX295 the
fastest NAMD-cuda unit – although the 295 has 15-20% lower clock speeds
than the 285, it has twice the number of Cuda cores (240 each gpu), and
apparently sufficient memory for each. This conclusion is supported by
comparison of ApoA1 results here with the GTX285 and 2xGTX280 results
reported in an earlier post of Mike Kuiper. A downside might be the fact
that the dual-gpu architecture requires more cpu cores, but that is not
likely to be a limiting factor in most systems (2x GTX295s with a Core
I7, 4x GTX295s with a dual Xeon).

5) Looking ahead to Fermi, it is not obvious how much speedup there will
be initially over the gtx295 – the first parts to be released will have
512 cuda cores (or less if reduced as a result of rumored production
problems). The huge improvement is in double precision and internal
architecture and do we know if this will be useful before NAMD is
substantially revised? In the long run, these improvements should
greatly facilitate moving virtually all of the NAMD job to the gpu, can
Jim Phillips give us any perspective on that?

Benchmark details: er-gre (36,753), ApoA1 (92,224 atoms), F1ATPase
(32,7506 atoms), and STMV (106,6628 atoms) are from the NAMD site, with
minor changes: 1) outputEnergies is set at 600 to take it out of the 500
step benchmark picture (as suggested in release notes). 2) outputtiming
at 100. 3) NBFIX statements in ApoA1 and f1ATPase xplor files are
commented out. [this makes no difference to ApoA1, since there are no
Na, K or Mg atoms, but it may have a small effect on the f1 simulation,
which has Na and Mg]. DHFR (23,558 atoms) is from the Amber benchmark
page, modified as above as to outputEnergies and outputTiming (DHFR was
also run with the cutoff and step parameters from ApoA1 “cut12”). All
runs are with +setcpuaffinity, 500 steps.

Regards,
Biff Forbush

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:55:19 CST