Re: Tesla Utilization -- NAMD 2.7b1

From: Gianluca Interlandi (
Date: Fri Aug 28 2009 - 11:13:13 CDT

Thank you Alex for taking the time and providing a detailed description of
your benchmarks.


On Fri, 28 Aug 2009, Axel Kohlmeyer wrote:

> On Thu, 2009-08-27 at 17:03 -0700, Gianluca Interlandi wrote:
>> Thank you Ron for posting this.
> ron, gianluca,
> since nobody else seems to have the courage to step up,
> a few comments from my personal tests.
> running the GPU version of NAMD requires some special care.
> it is not like you just plug a GPU into your machine and
> you outrun any supercomputer. in NAMD currently only the
> non-bonded force calculation is offloaded to the GPU. also,
> you currently cannot mix and match GPU and non-GPU non-bonded
> kernels, but you can oversubscribe GPUs unless the nvidia
> driver is configured for compute-exclusive mode.
> with one NAMD task and one high-end GPU (Tesla C1060/S1070,
> GTX 285/275) i have seen between no and about 5x speedup,
> using a second task on the same GPU or a second GPU has never
> worked reliably for me, but that may be due to reasons outside
> of NAMD, i have never had the time to track that down. from
> intermediate timings, i can see that oversubscribing gives
> a smaller, but still some additional speedup.
> as for hardware choice. there are two key elements:
> the memory bandwidth and the PCIe speed. AMD opteron-like
> or intel nehalem-like (i7-9x0, xeon X55x0) processors
> have 2-3x more memory bandwidth than intel core 2 and
> earlier. for a single GPU this is not making a massive
> difference, but still noticable. when using multiple GPUs
> it is essential. also processor and memory affinity need
> to be exploited for optimal performance. so intel nehalem
> would currently be the best choice in terms of memory bandwidth.
> for the PCIe bus and generation 2.0 slot with 16 lanes is
> required to get a decent performance. the intel i/o hub offers
> 38 PCIe lanes, and now you'd have to check how they are wired.
> ideally you want two 16x slots and one 4x slot. there are also
> more dual CPU board with two i/o hubs and four 16x PCI slots,
> making adding 8 GPUs (4x GTX 295 or 2x Tesla S1070) the maximum
> configuration, but here you have to keep in mind that in this
> setup two GPUs would each share a PCIe slot behind a PCIe
> bridge. nevertheless, this seems to be the configuration
> providing the most GPU performance in a single node. however,
> i know of some tests where people tried that, and it turned
> out that mainboards had additional constraints on how the
> PCIe lanes are used and reducing one of the formally 16x slots
> to 4x speed if a 4x speed device is used in the 4x slot.
> also there are mainboard that use 16x connectors, but have have
> them wired as 4x. looking at the fine print in the specs is
> highly recommended.
> now comparing a single core with and without GPU is not really
> a realistic comparison as it doesn't factor in the communication
> overhead when running NAMD with multiple tasks. so here are some
> numbers i recently was able to generate on a supermicro dual
> nehalem 1U node with two tesla M1060 cards. this is with gcc/g++
> compiled binaries, the equivalent intel icc/icpc binaries should
> be a bit faster in CPU mode. this is a test system with about
> 12,000 atoms with PBC, full electrostatics etc... that gave
> fairly good speedup with NAMD-gpu.
> using one cpu (core):
> WallClock: 762.491333 CPUTime: 760.654358
> Benchmark time: 1 CPUs 0.144077 s/step 0.833778 days/ns 44.4903 MB
> using one cpu and one gpu:
> WallClock: 195.666748 CPUTime: 195.567276
> Benchmark time: 1 CPUs 0.0341225 s/step 0.197468 days/ns 30.2373 MB
> using four cpu (cores):
> WallClock: 243.656143 CPUTime: 233.590485
> Benchmark time: 4 CPUs 0.0386488 s/step 0.223662 days/ns 19.7549 MB
> using eight cpu (cores):
> WallClock: 134.973953 CPUTime: 130.455170
> Benchmark time: 8 CPUs 0.0210909 s/step 0.122054 days/ns 15.5645 MB
> so 1 GPU / 1Core beats using 4 cores, but not using 8 cores
> of 2.8GHz nehalems. there were indications that using both GPUs
> with two cores (if i had gotten it to work), would be a tiny bit
> faster than using all 8 cores and 2x oversubscription would push
> it a little bit further.
> other, larger inputs i was testing, had a little bit less speeudp.
> so on average it _currently_ looks like for NAMD, the best improvement
> from adding a GPU can be had on an older single/dual core CPU
> with a full 16x PCIe bus and a cost efficient GTX 275/285 card.
> if you _do_ want to go for a multi-GPU environment (and don't
> mind the cost), i would try to get a 1:1 ratio of cpu cores and
> GPUs, this is the best way to have really dense compute power.
> and those machines seem to work excellent for multiple 1core/1GPU
> jobs where the degradation of memory and PCIe bus due to multiple
> accesses is less of an issue as the jobs will more-or-less automatically
> balance themselves to optimally interleave. i got only about 10%
> penalty from running 4 concurrent 1core/1gpu jobs on a dual harpertown
> node with four gpus (a full tesla S1070) connected via 2x 8-lane PCIe
> on a mainbord with a confirmed bad PCIe/memory bus performance.
> this may change drastically with future improvements of the GPU
> hardware. i don't know any details, but you can search the web
> and see what the rumor mills have produced and if you take away
> a bit from those speculations, you probably have a good low
> estimate on what the next generation GPU hardware can produce.
>> I am also interested in the same question. Also, I would like to know more
>> precisely what brand and model of motherboard and what brand of CPU do
>> people have experience with. Does it make sense to purchase for example a
> please don't ask (me) about any specific advice about a specific
> brand or model of hardware. you have to make benchmarks for yourself.
> lots of details matter. this is a bit of the frontier of HPC and
> it is impossible to keep track of everything.
> one final remark. if you want to boost the performance of NAMD on
> a desktop by using the GPU, please note that the X server and
> the NAMD would compete for the GPU (if there is only one) and thus
> it might be better to have a GPU dedicated for X and another for
> GPU computing. some applications, sometimes even little gadgets that
> sit in the corner of your desktop and don't consume much "real estate"
> may force a lot of screen updates and can significantly lower the
> performance of the GPU in compute mode. i routinely switch my desktop
> to textmode for any GPU benchmarks. again, it depends on the individual
> setup and thus you'd have to test for yourself.
> if somebody else would pitch in and provide additional numbers
> and/or additional comments it would help a lot to have a more
> balanced views, since i only tested a few hardware configurations
> and in a specific way that is important to our local projects.
> cheers,
> axel.
>> Lenovo D20 workstation and equip it with one or two Tesla C1060 cards?
>> It would be great to hear other people's experiences and tips.
>> Thanks,
>> Gianluca
>> On Thu, 27 Aug 2009, Ron Stubbs wrote:
>>> Hi All,
>>> I have a researcher interested in buying a dual quad core system with a
>>> single C1060 Tesla card and would like to get some input from current
>>> Tesla/NAMD users.
>>> Does anyone have any experience and/or performance data using NAMD 2.7
>>> on a single dual quad core system with a single Tesla card?
>>> What I specifically want to know, is whether I can use all eight cores
>>> with the GPU or would I need to scale back the number of cores to get
>>> maximum throughput or possible add a second Tesla card. Is there any
>>> rule of thumb for cores per GPU?
>>> Thanks,
>>> Ron
>>> --
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> Ron Stubbs MS CS
>>> Senior Systems Programmer
>>> Research Computing
>>> Pratt School of Engineering
>>> 1454A Fitzpatrick Center Box 90271
>>> Duke University, Durham, N.C. 27708-0271
>>> office: (919)660-5339 cell:(919)641-5689
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> -----------------------------------------------------
>> Gianluca Interlandi, PhD
>> +1 (206) 685 4435
>> +1 (206) 714 4303
>> Postdoc at the Department of Bioengineering
>> at the University of Washington, Seattle WA U.S.A.
>> -----------------------------------------------------
> --
> Dr. Axel Kohlmeyer
> Research Associate Professor
> Institute for Computational Molecular Science
> College of Science and Technology
> Temple University, Philadelphia PA, USA.

Gianluca Interlandi, PhD
                     +1 (206) 685 4435
                     +1 (206) 714 4303

Postdoc at the Department of Bioengineering
at the University of Washington, Seattle WA U.S.A.

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:53:14 CST