From: Axel Kohlmeyer (akohlmey_at_gmail.com)
Date: Fri Aug 28 2009 - 10:20:21 CDT
On Thu, 2009-08-27 at 17:03 -0700, Gianluca Interlandi wrote:
> Thank you Ron for posting this.
since nobody else seems to have the courage to step up,
a few comments from my personal tests.
running the GPU version of NAMD requires some special care.
it is not like you just plug a GPU into your machine and
you outrun any supercomputer. in NAMD currently only the
non-bonded force calculation is offloaded to the GPU. also,
you currently cannot mix and match GPU and non-GPU non-bonded
kernels, but you can oversubscribe GPUs unless the nvidia
driver is configured for compute-exclusive mode.
with one NAMD task and one high-end GPU (Tesla C1060/S1070,
GTX 285/275) i have seen between no and about 5x speedup,
using a second task on the same GPU or a second GPU has never
worked reliably for me, but that may be due to reasons outside
of NAMD, i have never had the time to track that down. from
intermediate timings, i can see that oversubscribing gives
a smaller, but still some additional speedup.
as for hardware choice. there are two key elements:
the memory bandwidth and the PCIe speed. AMD opteron-like
or intel nehalem-like (i7-9x0, xeon X55x0) processors
have 2-3x more memory bandwidth than intel core 2 and
earlier. for a single GPU this is not making a massive
difference, but still noticable. when using multiple GPUs
it is essential. also processor and memory affinity need
to be exploited for optimal performance. so intel nehalem
would currently be the best choice in terms of memory bandwidth.
for the PCIe bus and generation 2.0 slot with 16 lanes is
required to get a decent performance. the intel i/o hub offers
38 PCIe lanes, and now you'd have to check how they are wired.
ideally you want two 16x slots and one 4x slot. there are also
more dual CPU board with two i/o hubs and four 16x PCI slots,
making adding 8 GPUs (4x GTX 295 or 2x Tesla S1070) the maximum
configuration, but here you have to keep in mind that in this
setup two GPUs would each share a PCIe slot behind a PCIe
bridge. nevertheless, this seems to be the configuration
providing the most GPU performance in a single node. however,
i know of some tests where people tried that, and it turned
out that mainboards had additional constraints on how the
PCIe lanes are used and reducing one of the formally 16x slots
to 4x speed if a 4x speed device is used in the 4x slot.
also there are mainboard that use 16x connectors, but have have
them wired as 4x. looking at the fine print in the specs is
now comparing a single core with and without GPU is not really
a realistic comparison as it doesn't factor in the communication
overhead when running NAMD with multiple tasks. so here are some
numbers i recently was able to generate on a supermicro dual
nehalem 1U node with two tesla M1060 cards. this is with gcc/g++
compiled binaries, the equivalent intel icc/icpc binaries should
be a bit faster in CPU mode. this is a test system with about
12,000 atoms with PBC, full electrostatics etc... that gave
fairly good speedup with NAMD-gpu.
using one cpu (core):
WallClock: 762.491333 CPUTime: 760.654358
Benchmark time: 1 CPUs 0.144077 s/step 0.833778 days/ns 44.4903 MB
using one cpu and one gpu:
WallClock: 195.666748 CPUTime: 195.567276
Benchmark time: 1 CPUs 0.0341225 s/step 0.197468 days/ns 30.2373 MB
using four cpu (cores):
WallClock: 243.656143 CPUTime: 233.590485
Benchmark time: 4 CPUs 0.0386488 s/step 0.223662 days/ns 19.7549 MB
using eight cpu (cores):
WallClock: 134.973953 CPUTime: 130.455170
Benchmark time: 8 CPUs 0.0210909 s/step 0.122054 days/ns 15.5645 MB
so 1 GPU / 1Core beats using 4 cores, but not using 8 cores
of 2.8GHz nehalems. there were indications that using both GPUs
with two cores (if i had gotten it to work), would be a tiny bit
faster than using all 8 cores and 2x oversubscription would push
it a little bit further.
other, larger inputs i was testing, had a little bit less speeudp.
so on average it _currently_ looks like for NAMD, the best improvement
from adding a GPU can be had on an older single/dual core CPU
with a full 16x PCIe bus and a cost efficient GTX 275/285 card.
if you _do_ want to go for a multi-GPU environment (and don't
mind the cost), i would try to get a 1:1 ratio of cpu cores and
GPUs, this is the best way to have really dense compute power.
and those machines seem to work excellent for multiple 1core/1GPU
jobs where the degradation of memory and PCIe bus due to multiple
accesses is less of an issue as the jobs will more-or-less automatically
balance themselves to optimally interleave. i got only about 10%
penalty from running 4 concurrent 1core/1gpu jobs on a dual harpertown
node with four gpus (a full tesla S1070) connected via 2x 8-lane PCIe
on a mainbord with a confirmed bad PCIe/memory bus performance.
this may change drastically with future improvements of the GPU
hardware. i don't know any details, but you can search the web
and see what the rumor mills have produced and if you take away
a bit from those speculations, you probably have a good low
estimate on what the next generation GPU hardware can produce.
> I am also interested in the same question. Also, I would like to know more
> precisely what brand and model of motherboard and what brand of CPU do
> people have experience with. Does it make sense to purchase for example a
please don't ask (me) about any specific advice about a specific
brand or model of hardware. you have to make benchmarks for yourself.
lots of details matter. this is a bit of the frontier of HPC and
it is impossible to keep track of everything.
one final remark. if you want to boost the performance of NAMD on
a desktop by using the GPU, please note that the X server and
the NAMD would compete for the GPU (if there is only one) and thus
it might be better to have a GPU dedicated for X and another for
GPU computing. some applications, sometimes even little gadgets that
sit in the corner of your desktop and don't consume much "real estate"
may force a lot of screen updates and can significantly lower the
performance of the GPU in compute mode. i routinely switch my desktop
to textmode for any GPU benchmarks. again, it depends on the individual
setup and thus you'd have to test for yourself.
if somebody else would pitch in and provide additional numbers
and/or additional comments it would help a lot to have a more
balanced views, since i only tested a few hardware configurations
and in a specific way that is important to our local projects.
> Lenovo D20 workstation and equip it with one or two Tesla C1060 cards?
> It would be great to hear other people's experiences and tips.
> On Thu, 27 Aug 2009, Ron Stubbs wrote:
> > Hi All,
> > I have a researcher interested in buying a dual quad core system with a
> > single C1060 Tesla card and would like to get some input from current
> > Tesla/NAMD users.
> > Does anyone have any experience and/or performance data using NAMD 2.7
> > on a single dual quad core system with a single Tesla card?
> > What I specifically want to know, is whether I can use all eight cores
> > with the GPU or would I need to scale back the number of cores to get
> > maximum throughput or possible add a second Tesla card. Is there any
> > rule of thumb for cores per GPU?
> > Thanks,
> > Ron
> > --
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > Ron Stubbs MS CS
> > Senior Systems Programmer
> > Research Computing
> > Pratt School of Engineering
> > 1454A Fitzpatrick Center Box 90271
> > Duke University, Durham, N.C. 27708-0271
> > office: (919)660-5339 cell:(919)641-5689
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
> +1 (206) 685 4435
> +1 (206) 714 4303
> Postdoc at the Department of Bioengineering
> at the University of Washington, Seattle WA U.S.A.
-- Dr. Axel Kohlmeyer akohlmey_at_gmail.com Research Associate Professor Institute for Computational Molecular Science College of Science and Technology Temple University, Philadelphia PA, USA.
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:53:14 CST