Re: GTX-660 Ti benchmark

From: Aron Broom (broomsday_at_gmail.com)
Date: Tue Sep 18 2012 - 11:41:47 CDT

Norman: that's true, I had forgotten entirely about the Quadros

Guanglei:

I had at one point used a 2x6 CPU-core machine with 4xM2070s. In terms of
overall speed running one simulation with all 12-cores and all 4-gpus gave
no improvement over 6-cores with 2-gpus, which itself was barely faster
than 6-cores with 1 gpu. I suspect a lot of this was because of bandwidth,
that is, I think all 4 of those cards had to share the same PCI 2.0
connection.

I think in general if you can run multiple simulations (i.e. you are doing
REMD or something similar) it's best to just use 1 gpu for each
simulation. The sum of running two simulations each with 3-cores and 1-gpu
was substantially faster than a single simulation of 6-cores with 2-gpus.
I suspect this might be because data is being sent across the PCI
connection at different times, so you can actually get more out of it.

I imagine many of these issues will simply be irrelevant when/if a
completely GPU-only code is released, and then we might see some tremendous
things from the Kepler cards.

~Aron

On Tue, Sep 18, 2012 at 10:43 AM, Michael Galloway <gallowaymd_at_ornl.gov>wrote:

> interesting discussion, i too have a new, single gpu node similar to
> yours, i'd be interested in the details of benchmarking on this node as
> well.
>
> thanks for the interesting thead :-)
>
> --- michael
>
>
> On 09/18/2012 10:38 AM, Guanglei Cui wrote:
>
>> Hi Aron and Norman,
>>
>> Thanks for the additional insights. I guess this explains why I saw
>> slightly better performance on my Quadro 4000 than on M2090.
>>
>> I guess for small scale operation (as opposed to larger super
>> computing centers), spending money on two M2090 cards doesn't make too
>> much sense. One additional question ... for two M2090 cards in a
>> single node (12 cores), what's the most optimal way of using them? In
>> my experience, using two simultaneously doesn't seem to improve the
>> namd2.9 (cuda and multicore) performance very much.
>>
>> Regards,
>> Guanglei
>>
>> On Tue, Sep 18, 2012 at 2:36 AM, Norman Geist
>> <norman.geist_at_uni-greifswald.**de <norman.geist_at_uni-greifswald.de>>
>> wrote:
>>
>>> Hello,
>>>
>>>
>>>
>>> Just some comments:
>>>
>>>
>>>
>>> Nvidias workstation series are called Quadro, so it’s just wrong to call
>>> the
>>> professional HPC Tesla series a workstation card and also to confuse them
>>> with consumer hardware. The workstation cards are also consumer hardware,
>>> the Tesla cards are non-consumer hardware.
>>>
>>>
>>>
>>> So:
>>>
>>>
>>>
>>> GTX – consumer - gaming
>>>
>>> Quadro – consumer - workstation
>>>
>>> Tesla – professional - HPC
>>>
>>>
>>>
>>> But I confirm with the other points you mentioned. Of course the gaming
>>> cards have higher clocks and therefore better performance, as they are
>>> meant for gaming and people don’t care about power consumption and heat
>>> emission. Also the ECC slows the Tesla a little. But a professional
>>> computing centre can’t use these overclocked gaming cards without heavy
>>> cooling and their lack of administration abilities. Of course for some
>>> nodes
>>> only, or a workstation, it’s ok to stay with the consumer hardware, in
>>> professional space, they are not the best choice IMHO.
>>>
>>>
>>>
>>> Regards
>>>
>>> Norman Geist.
>>>
>>>
>>>
>>> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.**edu<owner-namd-l_at_ks.uiuc.edu>]
>>> Im Auftrag
>>> von Aron Broom
>>> Gesendet: Dienstag, 18. September 2012 03:57
>>> An: Guanglei Cui
>>> Cc: namd-l_at_ks.uiuc.edu
>>> Betreff: Re: namd-l: GTX-660 Ti benchmark
>>>
>>>
>>>
>>> guanglei,
>>>
>>> just a quick point to make about cards: keep in mind that the very
>>> expensive
>>> workstation cards aren't actually any faster than the consumer
>>> counterparts.
>>> For instance, a GTX580 vs. an M2090, the 580 has the same number of cores
>>> and actually faster clock and memory speeds. The M2090 has more memory
>>> and
>>> that memory has error correcting code, hence the extra bucks. For the
>>> kepler series (I'm not sure the workstation cards are out yet?) the
>>> consumer
>>> cards will also be faster than the workstation ones at least in terms of
>>> single precision, but I think it's supposed to be the reverse for double
>>> precision.
>>>
>>> ~Aron
>>>
>>> On Mon, Sep 17, 2012 at 4:35 PM, Guanglei Cui <
>>> amber.mail.archive_at_gmail.com>
>>> wrote:
>>>
>>> Hi Jason and Thomas,
>>>
>>> Thanks very much for your input. This is very useful, as I was
>>> struggling to gauge my expectations on the GPU workstation we have
>>> since I have no comparison. It seems Jason may have a similar hardware
>>> setup. The OS installed here is Centos5.8. I'm not sure if this
>>> matters.
>>>
>>> Thomas, if your timing was from 1GPU/1CPU, I'd be thoroughly upset
>>> 'cause that is almost twice as fast as I could get on a much more
>>> expensive card. Would you be able to share additional information on
>>> your OS and any configurations that matter?
>>>
>>> Regards,
>>> Guanglei
>>>
>>>
>>> On Sun, Sep 16, 2012 at 6:08 PM, Roberts, Jason <Jason.Roberts_at_mh.org.au
>>> >
>>> wrote:
>>>
>>>> Hi Guanglei,
>>>>
>>>> We are running a 2U rack (2x Xeon E5645, 4xM2090) and although I don't
>>>> have the same setup I ran the Apoa1 benchmark allocating 6 cores and 1
>>>> M2090
>>>> (./namd2 +idlepoll +p6 +devices 0 apoa1.namd > apoa1_6.out). The default
>>>> benchmark gave 0.049 s/step. I changed the outputEnergies and
>>>> outputTiming
>>>> values to 1000 and extended the run to 10000 steps and got 0.038 s/step.
>>>>
>>>> If I run the last simulation with 1 core and 1 GPU (./namd2 +idlepoll
>>>> +p1
>>>> +devices 0 apoa1.namd > apoa1_1.out) I get 0.122 s/step.
>>>>
>>>> Hope this helps.
>>>>
>>>> PS, if anyone is interested, I ran multiple simultaneous runs with
>>>> different combinations of CPU and GPU allocations and obtained the
>>>> following
>>>> results:
>>>>
>>>> Apoa1 (10,000 steps, timestep = 1, outputs at 1000steps)
>>>> 1 run (12xThreads 4xM2090) = 0.015 s/step
>>>> 1 run (24xThreads 4xM2090) = 0.016 s/step
>>>> 2 runs (6xThreads, 2xM2090) each = 0.027 s/step
>>>> 2 runs (12xThreads, 4xM2090 shared) = 0.026 s/step
>>>> 4 runs (3xThreads, 1xM2090) each = 0.051 s/step
>>>> 4 runs (6xThreads, 4xM2090 shared) = 0.046 s/step
>>>> 8 runs (3xThreads, 4xM2090 shared) = 0.088 s/step
>>>>
>>>> (Hyperthreading is ON)
>>>>
>>>> Cheers,
>>>>
>>>> Jason A. Roberts
>>>> Senior Medical Scientist
>>>> National Enterovirus Reference Laboratory
>>>> WHO Poliomyelitis Regional Reference Laboratory
>>>> VIDRL, 10 Wreckyn Street,
>>>> North Melbourne, Australia, 3051
>>>> Phone: +613 9342 2607
>>>> Fax: +613 9342 2665
>>>> email: polio_at_mh.org.au (lab enquiries)
>>>> web site: www.vidrl.org.au
>>>>
>>>> Date: Fri, 14 Sep 2012 09:50:41 -0400
>>>> From: Guanglei Cui <amber.mail.archive_at_gmail.com>
>>>> Subject: Re: namd-l: GTX-660 Ti benchmark
>>>>
>>>> Hi,
>>>>
>>>> I'm curious what kind of performance I should expect from a M2090 card
>>>> (Intel Xeon X5670, CentOS 5.8). With 1 CPU and 1GPU, I get 0.11 s/step
>>>> on
>>>> Apoa1 (2000 steps, timestep 1) using the namd2.9 multicore CUDA binary
>>>> from
>>>> the NAMD website. I suspect this is a reasonable speed. I wonder if
>>>> someone
>>>> would kindly point out what a reasonable expectation is for this type of
>>>> setup, and how to achieve that. Thanks very much.
>>>>
>>>> Guanglei
>>>>
>>>> On Thu, Sep 13, 2012 at 11:10 PM, Wenyu Zhong <wenyuzhong_at_gmail.com>
>>>> wrote:
>>>>
>>>>> Sorry, a correction.
>>>>>
>>>>> The power consumption with i5_at_3.7G+660ti running apoa1 is about 200w,
>>>>> and with i5_at_3.7G+2*460 is about 260w.
>>>>>
>>>>> Wenyu
>>>>>
>>>>
>>>>
>>>> - --
>>>> Guanglei Cui
>>>>
>>>>
>>>>
>>> --
>>> Guanglei Cui
>>>
>>>
>>>
>>>
>>> --
>>> Aron Broom M.Sc
>>> PhD Student
>>> Department of Chemistry
>>> University of Waterloo
>>>
>>
>>
>>
>

-- 
Aron Broom M.Sc
PhD Student
Department of Chemistry
University of Waterloo

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:22:05 CST