Re: GTX-660 Ti benchmark

From: Guanglei Cui (amber.mail.archive_at_gmail.com)
Date: Tue Sep 18 2012 - 09:38:39 CDT

Hi Aron and Norman,

Thanks for the additional insights. I guess this explains why I saw
slightly better performance on my Quadro 4000 than on M2090.

I guess for small scale operation (as opposed to larger super
computing centers), spending money on two M2090 cards doesn't make too
much sense. One additional question ... for two M2090 cards in a
single node (12 cores), what's the most optimal way of using them? In
my experience, using two simultaneously doesn't seem to improve the
namd2.9 (cuda and multicore) performance very much.

Regards,
Guanglei

On Tue, Sep 18, 2012 at 2:36 AM, Norman Geist
<norman.geist_at_uni-greifswald.de> wrote:
> Hello,
>
>
>
> Just some comments:
>
>
>
> Nvidias workstation series are called Quadro, so it’s just wrong to call the
> professional HPC Tesla series a workstation card and also to confuse them
> with consumer hardware. The workstation cards are also consumer hardware,
> the Tesla cards are non-consumer hardware.
>
>
>
> So:
>
>
>
> GTX – consumer - gaming
>
> Quadro – consumer - workstation
>
> Tesla – professional - HPC
>
>
>
> But I confirm with the other points you mentioned. Of course the gaming
> cards have higher clocks and therefore better performance, as they are
> meant for gaming and people don’t care about power consumption and heat
> emission. Also the ECC slows the Tesla a little. But a professional
> computing centre can’t use these overclocked gaming cards without heavy
> cooling and their lack of administration abilities. Of course for some nodes
> only, or a workstation, it’s ok to stay with the consumer hardware, in
> professional space, they are not the best choice IMHO.
>
>
>
> Regards
>
> Norman Geist.
>
>
>
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag
> von Aron Broom
> Gesendet: Dienstag, 18. September 2012 03:57
> An: Guanglei Cui
> Cc: namd-l_at_ks.uiuc.edu
> Betreff: Re: namd-l: GTX-660 Ti benchmark
>
>
>
> guanglei,
>
> just a quick point to make about cards: keep in mind that the very expensive
> workstation cards aren't actually any faster than the consumer counterparts.
> For instance, a GTX580 vs. an M2090, the 580 has the same number of cores
> and actually faster clock and memory speeds. The M2090 has more memory and
> that memory has error correcting code, hence the extra bucks. For the
> kepler series (I'm not sure the workstation cards are out yet?) the consumer
> cards will also be faster than the workstation ones at least in terms of
> single precision, but I think it's supposed to be the reverse for double
> precision.
>
> ~Aron
>
> On Mon, Sep 17, 2012 at 4:35 PM, Guanglei Cui <amber.mail.archive_at_gmail.com>
> wrote:
>
> Hi Jason and Thomas,
>
> Thanks very much for your input. This is very useful, as I was
> struggling to gauge my expectations on the GPU workstation we have
> since I have no comparison. It seems Jason may have a similar hardware
> setup. The OS installed here is Centos5.8. I'm not sure if this
> matters.
>
> Thomas, if your timing was from 1GPU/1CPU, I'd be thoroughly upset
> 'cause that is almost twice as fast as I could get on a much more
> expensive card. Would you be able to share additional information on
> your OS and any configurations that matter?
>
> Regards,
> Guanglei
>
>
> On Sun, Sep 16, 2012 at 6:08 PM, Roberts, Jason <Jason.Roberts_at_mh.org.au>
> wrote:
>> Hi Guanglei,
>>
>> We are running a 2U rack (2x Xeon E5645, 4xM2090) and although I don't
>> have the same setup I ran the Apoa1 benchmark allocating 6 cores and 1 M2090
>> (./namd2 +idlepoll +p6 +devices 0 apoa1.namd > apoa1_6.out). The default
>> benchmark gave 0.049 s/step. I changed the outputEnergies and outputTiming
>> values to 1000 and extended the run to 10000 steps and got 0.038 s/step.
>>
>> If I run the last simulation with 1 core and 1 GPU (./namd2 +idlepoll +p1
>> +devices 0 apoa1.namd > apoa1_1.out) I get 0.122 s/step.
>>
>> Hope this helps.
>>
>> PS, if anyone is interested, I ran multiple simultaneous runs with
>> different combinations of CPU and GPU allocations and obtained the following
>> results:
>>
>> Apoa1 (10,000 steps, timestep = 1, outputs at 1000steps)
>> 1 run (12xThreads 4xM2090) = 0.015 s/step
>> 1 run (24xThreads 4xM2090) = 0.016 s/step
>> 2 runs (6xThreads, 2xM2090) each = 0.027 s/step
>> 2 runs (12xThreads, 4xM2090 shared) = 0.026 s/step
>> 4 runs (3xThreads, 1xM2090) each = 0.051 s/step
>> 4 runs (6xThreads, 4xM2090 shared) = 0.046 s/step
>> 8 runs (3xThreads, 4xM2090 shared) = 0.088 s/step
>>
>> (Hyperthreading is ON)
>>
>> Cheers,
>>
>> Jason A. Roberts
>> Senior Medical Scientist
>> National Enterovirus Reference Laboratory
>> WHO Poliomyelitis Regional Reference Laboratory
>> VIDRL, 10 Wreckyn Street,
>> North Melbourne, Australia, 3051
>> Phone: +613 9342 2607
>> Fax: +613 9342 2665
>> email: polio_at_mh.org.au (lab enquiries)
>> web site: www.vidrl.org.au
>>
>> Date: Fri, 14 Sep 2012 09:50:41 -0400
>> From: Guanglei Cui <amber.mail.archive_at_gmail.com>
>> Subject: Re: namd-l: GTX-660 Ti benchmark
>>
>> Hi,
>>
>> I'm curious what kind of performance I should expect from a M2090 card
>> (Intel Xeon X5670, CentOS 5.8). With 1 CPU and 1GPU, I get 0.11 s/step on
>> Apoa1 (2000 steps, timestep 1) using the namd2.9 multicore CUDA binary from
>> the NAMD website. I suspect this is a reasonable speed. I wonder if someone
>> would kindly point out what a reasonable expectation is for this type of
>> setup, and how to achieve that. Thanks very much.
>>
>> Guanglei
>>
>> On Thu, Sep 13, 2012 at 11:10 PM, Wenyu Zhong <wenyuzhong_at_gmail.com>
>> wrote:
>>> Sorry, a correction.
>>>
>>> The power consumption with i5_at_3.7G+660ti running apoa1 is about 200w,
>>> and with i5_at_3.7G+2*460 is about 260w.
>>>
>>> Wenyu
>>
>>
>>
>> - --
>> Guanglei Cui
>>
>>
>
>
> --
> Guanglei Cui
>
>
>
>
> --
> Aron Broom M.Sc
> PhD Student
> Department of Chemistry
> University of Waterloo

-- 
Guanglei Cui

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:22:35 CST