Re: Tesla Utilization -- NAMD 2.7b1

From: Roman Petrenko (
Date: Fri Aug 28 2009 - 11:49:18 CDT

Axel, thanks for explanations.
I should add that with 9800gtx gpu i also came to conclusion namd on
1gpu+1core performs similar to quad-core cpu.

On Fri, Aug 28, 2009 at 12:13 PM, Gianluca
Interlandi<> wrote:
> Thank you Alex for taking the time and providing a detailed description of
> your benchmarks.
> Gianluca
> On Fri, 28 Aug 2009, Axel Kohlmeyer wrote:
>> On Thu, 2009-08-27 at 17:03 -0700, Gianluca Interlandi wrote:
>>> Thank you Ron for posting this.
>> ron, gianluca,
>> since nobody else seems to have the courage to step up,
>> a few comments from my personal tests.
>> running the GPU version of NAMD requires some special care.
>> it is not like you just plug a GPU into your machine and
>> you outrun any supercomputer. in NAMD currently only the
>> non-bonded force calculation is offloaded to the GPU. also,
>> you currently cannot mix and match GPU and non-GPU non-bonded
>> kernels, but you can oversubscribe GPUs unless the nvidia
>> driver is configured for compute-exclusive mode.
>> with one NAMD task and one high-end GPU (Tesla C1060/S1070,
>> GTX 285/275) i have seen between no and about 5x speedup,
>> using a second task on the same GPU or a second GPU has never
>> worked reliably for me, but that may be due to reasons outside
>> of NAMD, i have never had the time to track that down. from
>> intermediate timings, i can see that oversubscribing gives
>> a smaller, but still some additional speedup.
>> as for hardware choice. there are two key elements:
>> the memory bandwidth and the PCIe speed. AMD opteron-like
>> or intel nehalem-like (i7-9x0, xeon X55x0) processors
>> have 2-3x more memory bandwidth than intel core 2 and
>> earlier. for a single GPU this is not making a massive
>> difference, but still noticable. when using multiple GPUs
>> it is essential. also processor and memory affinity need
>> to be exploited for optimal performance. so intel nehalem
>> would currently be the best choice in terms of memory bandwidth.
>> for the PCIe bus and generation 2.0 slot with 16 lanes is
>> required to get a decent performance. the intel i/o hub offers
>> 38 PCIe lanes, and now you'd have to check how they are wired.
>> ideally you want two 16x slots and one 4x slot. there are also
>> more dual CPU board with two i/o hubs and four 16x PCI slots,
>> making adding 8 GPUs (4x GTX 295 or 2x Tesla S1070) the maximum
>> configuration, but here you have to keep in mind that in this
>> setup two GPUs would each share a PCIe slot behind a PCIe
>> bridge. nevertheless, this seems to be the configuration
>> providing the most GPU performance in a single node. however,
>> i know of some tests where people tried that, and it turned
>> out that mainboards had additional constraints on how the
>> PCIe lanes are used and reducing one of the formally 16x slots
>> to 4x speed if a 4x speed device is used in the 4x slot.
>> also there are mainboard that use 16x connectors, but have have
>> them wired as 4x. looking at the fine print in the specs is
>> highly recommended.
>> now comparing a single core with and without GPU is not really
>> a realistic comparison as it doesn't factor in the communication
>> overhead when running NAMD with multiple tasks. so here are some
>> numbers i recently was able to generate on a supermicro dual
>> nehalem 1U node with two tesla M1060 cards. this is with gcc/g++
>> compiled binaries, the equivalent intel icc/icpc binaries should
>> be a bit faster in CPU mode. this is a test system with about
>> 12,000 atoms with PBC, full electrostatics etc... that gave
>> fairly good speedup with NAMD-gpu.
>> using one cpu (core):
>> WallClock: 762.491333  CPUTime: 760.654358
>> Benchmark time: 1 CPUs 0.144077 s/step 0.833778 days/ns 44.4903 MB
>> using one cpu and one gpu:
>> WallClock: 195.666748  CPUTime: 195.567276
>> Benchmark time: 1 CPUs 0.0341225 s/step 0.197468 days/ns 30.2373 MB
>> using four cpu (cores):
>> WallClock: 243.656143  CPUTime: 233.590485
>> Benchmark time: 4 CPUs 0.0386488 s/step 0.223662 days/ns 19.7549 MB
>> using eight cpu (cores):
>> WallClock: 134.973953  CPUTime: 130.455170
>> Benchmark time: 8 CPUs 0.0210909 s/step 0.122054 days/ns 15.5645 MB
>> so 1 GPU / 1Core beats using 4 cores, but not using 8 cores
>> of 2.8GHz nehalems. there were indications that using both GPUs
>> with two cores (if i had gotten it to work), would be a tiny bit
>> faster than using all 8 cores and 2x oversubscription would push
>> it a little bit further.
>> other, larger inputs i was testing, had a little bit less speeudp.
>> so on average it _currently_ looks like for NAMD, the best improvement
>> from adding a GPU can be had on an older single/dual core CPU
>> with a full 16x PCIe bus and a cost efficient GTX 275/285 card.
>> if you _do_ want to go for a multi-GPU environment (and don't
>> mind the cost), i would try to get a 1:1 ratio of cpu cores and
>> GPUs, this is the best way to have really dense compute power.
>> and those machines seem to work excellent for multiple 1core/1GPU
>> jobs where the degradation of memory and PCIe bus due to multiple
>> accesses is less of an issue as the jobs will more-or-less automatically
>> balance themselves to optimally interleave. i got only about 10%
>> penalty from running 4 concurrent 1core/1gpu jobs on a dual harpertown
>> node with four gpus (a full tesla S1070) connected via 2x 8-lane PCIe
>> on a mainbord with a confirmed bad PCIe/memory bus performance.
>> this may change drastically with future improvements of the GPU
>> hardware. i don't know any details, but you can search the web
>> and see what the rumor mills have produced and if you take away
>> a bit from those speculations, you probably have a good low
>> estimate on what the next generation GPU hardware can produce.
>>> I am also interested in the same question. Also, I would like to know
>>> more
>>> precisely what brand and model of motherboard and what brand of CPU do
>>> people have experience with. Does it make sense to purchase for example a
>> please don't ask (me) about any specific advice about a specific
>> brand or model of hardware. you have to make benchmarks for yourself.
>> lots of details matter. this is a bit of the frontier of HPC and
>> it is impossible to keep track of everything.
>> one final remark. if you want to boost the performance of NAMD on
>> a desktop by using the GPU, please note that the X server and
>> the NAMD would compete for the GPU (if there is only one) and thus
>> it might be better to have a GPU dedicated for X and another for
>> GPU computing. some applications, sometimes even little gadgets that
>> sit in the corner of your desktop and don't consume much "real estate"
>> may force a lot of screen updates and can significantly lower the
>> performance of the GPU in compute mode. i routinely switch my desktop
>> to textmode for any GPU benchmarks. again, it depends on the individual
>> setup and thus you'd have to test for yourself.
>> if somebody else would pitch in and provide additional numbers
>> and/or additional comments it would help a lot to have a more
>> balanced views, since i only tested a few hardware configurations
>> and in a specific way that is important to our local projects.
>> cheers,
>>   axel.
>>> Lenovo D20 workstation and equip it with one or two Tesla C1060 cards?
>>> It would be great to hear other people's experiences and tips.
>>> Thanks,
>>>      Gianluca
>>> On Thu, 27 Aug 2009, Ron Stubbs wrote:
>>>> Hi All,
>>>> I have a researcher interested in buying a dual quad core system with a
>>>> single C1060 Tesla card and would like to get some input from current
>>>> Tesla/NAMD users.
>>>> Does anyone have any experience and/or performance data using NAMD 2.7
>>>> on a single dual quad core system with a single Tesla card?
>>>> What I specifically want to know, is whether I can use all eight cores
>>>> with the GPU or would I need to scale back the number of cores to get
>>>> maximum throughput or possible add a second Tesla card. Is there any
>>>> rule of thumb for cores per GPU?
>>>> Thanks,
>>>> Ron
>>>> --
>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>> Ron Stubbs MS CS
>>>> Senior Systems Programmer
>>>> Research Computing
>>>> Pratt School of Engineering
>>>> 1454A Fitzpatrick Center         Box 90271
>>>> Duke University,        Durham, N.C. 27708-0271
>>>> office: (919)660-5339   cell:(919)641-5689
>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> -----------------------------------------------------
>>> Gianluca Interlandi, PhD
>>>                     +1 (206) 685 4435
>>>                     +1 (206) 714 4303
>>> Postdoc at the Department of Bioengineering
>>> at the University of Washington, Seattle WA U.S.A.
>>> -----------------------------------------------------
>> --
>> Dr. Axel Kohlmeyer
>> Research Associate Professor
>> Institute for Computational Molecular Science
>> College of Science and Technology
>> Temple University, Philadelphia PA, USA.
> -----------------------------------------------------
> Gianluca Interlandi, PhD
>                    +1 (206) 685 4435
>                    +1 (206) 714 4303
> Postdoc at the Department of Bioengineering
> at the University of Washington, Seattle WA U.S.A.
> -----------------------------------------------------

Roman Petrenko
Physics Department
University of Cincinnati

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:53:14 CST