Re: GTX-660 Ti benchmark

From: Guanglei Cui (amber.mail.archive_at_gmail.com)
Date: Wed Sep 19 2012 - 08:32:38 CDT

Thanks very much, Norman. This has been very helpful.

Regards,
Guanglei

On Wed, Sep 19, 2012 at 2:50 AM, Norman Geist
<norman.geist_at_uni-greifswald.de> wrote:
> Hi,
>
>> -----Ursprüngliche Nachricht-----
>> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
>> Auftrag von Guanglei Cui
>> Gesendet: Dienstag, 18. September 2012 16:39
>> An: namd-l_at_ks.uiuc.edu
>> Betreff: Re: namd-l: GTX-660 Ti benchmark
>>
>> Hi Aron and Norman,
>>
>> Thanks for the additional insights. I guess this explains why I saw
>
> You are welcome.
>
>> slightly better performance on my Quadro 4000 than on M2090.
>
> Sounds reasonable.
>
>>
>> I guess for small scale operation (as opposed to larger super
>> computing centers), spending money on two M2090 cards doesn't make too
>> much sense. One additional question ... for two M2090 cards in a
>
> Of course.
>
>> single node (12 cores), what's the most optimal way of using them? In
>> my experience, using two simultaneously doesn't seem to improve the
>> namd2.9 (cuda and multicore) performance very much.
>
> I can only talk about the C2050. I got four machines each has two Xeon
> 6-core and two Tesla C2050.
> They are plugged to pcie-2. The speedup by using multiple GPUs depends on
> the molecular systems properties, the system size and the available CPU
> power. I got the following results on my GPU nodes for a one million atoms
> system (protein in water, biggest part water)
>
> CPUs GPUs time/step
>
> 1 0 20
> 2 0 9,7
> 3 0 6,5
> 4 0 5
> 5 0 4
> 6 0 3.5
> 7 0 3
> 8 0 2.5
> 9 0 2.2
> 10 0 2
> 11 0 1.83
> 12 0 1.8
>
> 1 1 2
> 2 1 1.45
> 3 1 1.2
> 4 1 0.8
> 5 1 0.8
> 6 1 0.7
> 7 1 0.7
> 8 1 0.66
> 9 1 0.66
> 10 1 0.63
> 11 1 0.63
> 12 1 0.62
>
> 1 2 -
> 2 2 1.35
> 3 2 1.1
> 4 2 0.75
> 5 2 0.65
> 6 2 0.51
> 7 2 0.51
> 8 2 0.44
> 9 2 0.43
> 10 2 0.4
> 11 2 0.4
> 12 2 0.36
>
> However, to single jobs, each using one of the GPUs will also slow down both
> jobs. So it's up to you how you use them. I usually don't care about that. I
> wrote a little GPU allocator to extend the SunGridEngine and just let the
> jobs fall on the machines like the queue decides. And I won't stop waiting
> the PME to be moved to the GPU as this will solve a big part of the pcie
> bottleneck for namd.
>
> Regards
> Norman Geist
>
>>
>> Regards,
>> Guanglei
>>
>> On Tue, Sep 18, 2012 at 2:36 AM, Norman Geist
>> <norman.geist_at_uni-greifswald.de> wrote:
>> > Hello,
>> >
>> >
>> >
>> > Just some comments:
>> >
>> >
>> >
>> > Nvidias workstation series are called Quadro, so it’s just wrong to
>> call the
>> > professional HPC Tesla series a workstation card and also to confuse
>> them
>> > with consumer hardware. The workstation cards are also consumer
>> hardware,
>> > the Tesla cards are non-consumer hardware.
>> >
>> >
>> >
>> > So:
>> >
>> >
>> >
>> > GTX – consumer - gaming
>> >
>> > Quadro – consumer - workstation
>> >
>> > Tesla – professional - HPC
>> >
>> >
>> >
>> > But I confirm with the other points you mentioned. Of course the
>> gaming
>> > cards have higher clocks and therefore better performance, as they
>> are
>> > meant for gaming and people don’t care about power consumption and
>> heat
>> > emission. Also the ECC slows the Tesla a little. But a professional
>> > computing centre can’t use these overclocked gaming cards without
>> heavy
>> > cooling and their lack of administration abilities. Of course for
>> some nodes
>> > only, or a workstation, it’s ok to stay with the consumer hardware,
>> in
>> > professional space, they are not the best choice IMHO.
>> >
>> >
>> >
>> > Regards
>> >
>> > Norman Geist.
>> >
>> >
>> >
>> > Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
>> Auftrag
>> > von Aron Broom
>> > Gesendet: Dienstag, 18. September 2012 03:57
>> > An: Guanglei Cui
>> > Cc: namd-l_at_ks.uiuc.edu
>> > Betreff: Re: namd-l: GTX-660 Ti benchmark
>> >
>> >
>> >
>> > guanglei,
>> >
>> > just a quick point to make about cards: keep in mind that the very
>> expensive
>> > workstation cards aren't actually any faster than the consumer
>> counterparts.
>> > For instance, a GTX580 vs. an M2090, the 580 has the same number of
>> cores
>> > and actually faster clock and memory speeds. The M2090 has more
>> memory and
>> > that memory has error correcting code, hence the extra bucks. For
>> the
>> > kepler series (I'm not sure the workstation cards are out yet?) the
>> consumer
>> > cards will also be faster than the workstation ones at least in terms
>> of
>> > single precision, but I think it's supposed to be the reverse for
>> double
>> > precision.
>> >
>> > ~Aron
>> >
>> > On Mon, Sep 17, 2012 at 4:35 PM, Guanglei Cui
>> <amber.mail.archive_at_gmail.com>
>> > wrote:
>> >
>> > Hi Jason and Thomas,
>> >
>> > Thanks very much for your input. This is very useful, as I was
>> > struggling to gauge my expectations on the GPU workstation we have
>> > since I have no comparison. It seems Jason may have a similar
>> hardware
>> > setup. The OS installed here is Centos5.8. I'm not sure if this
>> > matters.
>> >
>> > Thomas, if your timing was from 1GPU/1CPU, I'd be thoroughly upset
>> > 'cause that is almost twice as fast as I could get on a much more
>> > expensive card. Would you be able to share additional information on
>> > your OS and any configurations that matter?
>> >
>> > Regards,
>> > Guanglei
>> >
>> >
>> > On Sun, Sep 16, 2012 at 6:08 PM, Roberts, Jason
>> <Jason.Roberts_at_mh.org.au>
>> > wrote:
>> >> Hi Guanglei,
>> >>
>> >> We are running a 2U rack (2x Xeon E5645, 4xM2090) and although I
>> don't
>> >> have the same setup I ran the Apoa1 benchmark allocating 6 cores and
>> 1 M2090
>> >> (./namd2 +idlepoll +p6 +devices 0 apoa1.namd > apoa1_6.out). The
>> default
>> >> benchmark gave 0.049 s/step. I changed the outputEnergies and
>> outputTiming
>> >> values to 1000 and extended the run to 10000 steps and got 0.038
>> s/step.
>> >>
>> >> If I run the last simulation with 1 core and 1 GPU (./namd2
>> +idlepoll +p1
>> >> +devices 0 apoa1.namd > apoa1_1.out) I get 0.122 s/step.
>> >>
>> >> Hope this helps.
>> >>
>> >> PS, if anyone is interested, I ran multiple simultaneous runs with
>> >> different combinations of CPU and GPU allocations and obtained the
>> following
>> >> results:
>> >>
>> >> Apoa1 (10,000 steps, timestep = 1, outputs at 1000steps)
>> >> 1 run (12xThreads 4xM2090) = 0.015 s/step
>> >> 1 run (24xThreads 4xM2090) = 0.016 s/step
>> >> 2 runs (6xThreads, 2xM2090) each = 0.027 s/step
>> >> 2 runs (12xThreads, 4xM2090 shared) = 0.026 s/step
>> >> 4 runs (3xThreads, 1xM2090) each = 0.051 s/step
>> >> 4 runs (6xThreads, 4xM2090 shared) = 0.046 s/step
>> >> 8 runs (3xThreads, 4xM2090 shared) = 0.088 s/step
>> >>
>> >> (Hyperthreading is ON)
>> >>
>> >> Cheers,
>> >>
>> >> Jason A. Roberts
>> >> Senior Medical Scientist
>> >> National Enterovirus Reference Laboratory
>> >> WHO Poliomyelitis Regional Reference Laboratory
>> >> VIDRL, 10 Wreckyn Street,
>> >> North Melbourne, Australia, 3051
>> >> Phone: +613 9342 2607
>> >> Fax: +613 9342 2665
>> >> email: polio_at_mh.org.au (lab enquiries)
>> >> web site: www.vidrl.org.au
>> >>
>> >> Date: Fri, 14 Sep 2012 09:50:41 -0400
>> >> From: Guanglei Cui <amber.mail.archive_at_gmail.com>
>> >> Subject: Re: namd-l: GTX-660 Ti benchmark
>> >>
>> >> Hi,
>> >>
>> >> I'm curious what kind of performance I should expect from a M2090
>> card
>> >> (Intel Xeon X5670, CentOS 5.8). With 1 CPU and 1GPU, I get 0.11
>> s/step on
>> >> Apoa1 (2000 steps, timestep 1) using the namd2.9 multicore CUDA
>> binary from
>> >> the NAMD website. I suspect this is a reasonable speed. I wonder if
>> someone
>> >> would kindly point out what a reasonable expectation is for this
>> type of
>> >> setup, and how to achieve that. Thanks very much.
>> >>
>> >> Guanglei
>> >>
>> >> On Thu, Sep 13, 2012 at 11:10 PM, Wenyu Zhong <wenyuzhong_at_gmail.com>
>> >> wrote:
>> >>> Sorry, a correction.
>> >>>
>> >>> The power consumption with i5_at_3.7G+660ti running apoa1 is about
>> 200w,
>> >>> and with i5_at_3.7G+2*460 is about 260w.
>> >>>
>> >>> Wenyu
>> >>
>> >>
>> >>
>> >> - --
>> >> Guanglei Cui
>> >>
>> >>
>> >
>> >
>> > --
>> > Guanglei Cui
>> >
>> >
>> >
>> >
>> > --
>> > Aron Broom M.Sc
>> > PhD Student
>> > Department of Chemistry
>> > University of Waterloo
>>
>>
>>
>> --
>> Guanglei Cui
>

-- 
Guanglei Cui

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:22:05 CST