AW: GTX-660 Ti benchmark

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Wed Sep 19 2012 - 01:50:58 CDT

Hi,

> -----Ursprüngliche Nachricht-----
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> Auftrag von Guanglei Cui
> Gesendet: Dienstag, 18. September 2012 16:39
> An: namd-l_at_ks.uiuc.edu
> Betreff: Re: namd-l: GTX-660 Ti benchmark
>
> Hi Aron and Norman,
>
> Thanks for the additional insights. I guess this explains why I saw

You are welcome.

> slightly better performance on my Quadro 4000 than on M2090.

Sounds reasonable.

>
> I guess for small scale operation (as opposed to larger super
> computing centers), spending money on two M2090 cards doesn't make too
> much sense. One additional question ... for two M2090 cards in a

Of course.

> single node (12 cores), what's the most optimal way of using them? In
> my experience, using two simultaneously doesn't seem to improve the
> namd2.9 (cuda and multicore) performance very much.

I can only talk about the C2050. I got four machines each has two Xeon
6-core and two Tesla C2050.
They are plugged to pcie-2. The speedup by using multiple GPUs depends on
the molecular systems properties, the system size and the available CPU
power. I got the following results on my GPU nodes for a one million atoms
system (protein in water, biggest part water)

CPUs GPUs time/step

1 0 20
2 0 9,7
3 0 6,5
4 0 5
5 0 4
6 0 3.5
7 0 3
8 0 2.5
9 0 2.2
10 0 2
11 0 1.83
12 0 1.8

1 1 2
2 1 1.45
3 1 1.2
4 1 0.8
5 1 0.8
6 1 0.7
7 1 0.7
8 1 0.66
9 1 0.66
10 1 0.63
11 1 0.63
12 1 0.62

1 2 -
2 2 1.35
3 2 1.1
4 2 0.75
5 2 0.65
6 2 0.51
7 2 0.51
8 2 0.44
9 2 0.43
10 2 0.4
11 2 0.4
12 2 0.36

However, to single jobs, each using one of the GPUs will also slow down both
jobs. So it's up to you how you use them. I usually don't care about that. I
wrote a little GPU allocator to extend the SunGridEngine and just let the
jobs fall on the machines like the queue decides. And I won't stop waiting
the PME to be moved to the GPU as this will solve a big part of the pcie
bottleneck for namd.

Regards
Norman Geist

>
> Regards,
> Guanglei
>
> On Tue, Sep 18, 2012 at 2:36 AM, Norman Geist
> <norman.geist_at_uni-greifswald.de> wrote:
> > Hello,
> >
> >
> >
> > Just some comments:
> >
> >
> >
> > Nvidias workstation series are called Quadro, so it’s just wrong to
> call the
> > professional HPC Tesla series a workstation card and also to confuse
> them
> > with consumer hardware. The workstation cards are also consumer
> hardware,
> > the Tesla cards are non-consumer hardware.
> >
> >
> >
> > So:
> >
> >
> >
> > GTX – consumer - gaming
> >
> > Quadro – consumer - workstation
> >
> > Tesla – professional - HPC
> >
> >
> >
> > But I confirm with the other points you mentioned. Of course the
> gaming
> > cards have higher clocks and therefore better performance, as they
> are
> > meant for gaming and people don’t care about power consumption and
> heat
> > emission. Also the ECC slows the Tesla a little. But a professional
> > computing centre can’t use these overclocked gaming cards without
> heavy
> > cooling and their lack of administration abilities. Of course for
> some nodes
> > only, or a workstation, it’s ok to stay with the consumer hardware,
> in
> > professional space, they are not the best choice IMHO.
> >
> >
> >
> > Regards
> >
> > Norman Geist.
> >
> >
> >
> > Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> Auftrag
> > von Aron Broom
> > Gesendet: Dienstag, 18. September 2012 03:57
> > An: Guanglei Cui
> > Cc: namd-l_at_ks.uiuc.edu
> > Betreff: Re: namd-l: GTX-660 Ti benchmark
> >
> >
> >
> > guanglei,
> >
> > just a quick point to make about cards: keep in mind that the very
> expensive
> > workstation cards aren't actually any faster than the consumer
> counterparts.
> > For instance, a GTX580 vs. an M2090, the 580 has the same number of
> cores
> > and actually faster clock and memory speeds. The M2090 has more
> memory and
> > that memory has error correcting code, hence the extra bucks. For
> the
> > kepler series (I'm not sure the workstation cards are out yet?) the
> consumer
> > cards will also be faster than the workstation ones at least in terms
> of
> > single precision, but I think it's supposed to be the reverse for
> double
> > precision.
> >
> > ~Aron
> >
> > On Mon, Sep 17, 2012 at 4:35 PM, Guanglei Cui
> <amber.mail.archive_at_gmail.com>
> > wrote:
> >
> > Hi Jason and Thomas,
> >
> > Thanks very much for your input. This is very useful, as I was
> > struggling to gauge my expectations on the GPU workstation we have
> > since I have no comparison. It seems Jason may have a similar
> hardware
> > setup. The OS installed here is Centos5.8. I'm not sure if this
> > matters.
> >
> > Thomas, if your timing was from 1GPU/1CPU, I'd be thoroughly upset
> > 'cause that is almost twice as fast as I could get on a much more
> > expensive card. Would you be able to share additional information on
> > your OS and any configurations that matter?
> >
> > Regards,
> > Guanglei
> >
> >
> > On Sun, Sep 16, 2012 at 6:08 PM, Roberts, Jason
> <Jason.Roberts_at_mh.org.au>
> > wrote:
> >> Hi Guanglei,
> >>
> >> We are running a 2U rack (2x Xeon E5645, 4xM2090) and although I
> don't
> >> have the same setup I ran the Apoa1 benchmark allocating 6 cores and
> 1 M2090
> >> (./namd2 +idlepoll +p6 +devices 0 apoa1.namd > apoa1_6.out). The
> default
> >> benchmark gave 0.049 s/step. I changed the outputEnergies and
> outputTiming
> >> values to 1000 and extended the run to 10000 steps and got 0.038
> s/step.
> >>
> >> If I run the last simulation with 1 core and 1 GPU (./namd2
> +idlepoll +p1
> >> +devices 0 apoa1.namd > apoa1_1.out) I get 0.122 s/step.
> >>
> >> Hope this helps.
> >>
> >> PS, if anyone is interested, I ran multiple simultaneous runs with
> >> different combinations of CPU and GPU allocations and obtained the
> following
> >> results:
> >>
> >> Apoa1 (10,000 steps, timestep = 1, outputs at 1000steps)
> >> 1 run (12xThreads 4xM2090) = 0.015 s/step
> >> 1 run (24xThreads 4xM2090) = 0.016 s/step
> >> 2 runs (6xThreads, 2xM2090) each = 0.027 s/step
> >> 2 runs (12xThreads, 4xM2090 shared) = 0.026 s/step
> >> 4 runs (3xThreads, 1xM2090) each = 0.051 s/step
> >> 4 runs (6xThreads, 4xM2090 shared) = 0.046 s/step
> >> 8 runs (3xThreads, 4xM2090 shared) = 0.088 s/step
> >>
> >> (Hyperthreading is ON)
> >>
> >> Cheers,
> >>
> >> Jason A. Roberts
> >> Senior Medical Scientist
> >> National Enterovirus Reference Laboratory
> >> WHO Poliomyelitis Regional Reference Laboratory
> >> VIDRL, 10 Wreckyn Street,
> >> North Melbourne, Australia, 3051
> >> Phone: +613 9342 2607
> >> Fax: +613 9342 2665
> >> email: polio_at_mh.org.au (lab enquiries)
> >> web site: www.vidrl.org.au
> >>
> >> Date: Fri, 14 Sep 2012 09:50:41 -0400
> >> From: Guanglei Cui <amber.mail.archive_at_gmail.com>
> >> Subject: Re: namd-l: GTX-660 Ti benchmark
> >>
> >> Hi,
> >>
> >> I'm curious what kind of performance I should expect from a M2090
> card
> >> (Intel Xeon X5670, CentOS 5.8). With 1 CPU and 1GPU, I get 0.11
> s/step on
> >> Apoa1 (2000 steps, timestep 1) using the namd2.9 multicore CUDA
> binary from
> >> the NAMD website. I suspect this is a reasonable speed. I wonder if
> someone
> >> would kindly point out what a reasonable expectation is for this
> type of
> >> setup, and how to achieve that. Thanks very much.
> >>
> >> Guanglei
> >>
> >> On Thu, Sep 13, 2012 at 11:10 PM, Wenyu Zhong <wenyuzhong_at_gmail.com>
> >> wrote:
> >>> Sorry, a correction.
> >>>
> >>> The power consumption with i5_at_3.7G+660ti running apoa1 is about
> 200w,
> >>> and with i5_at_3.7G+2*460 is about 260w.
> >>>
> >>> Wenyu
> >>
> >>
> >>
> >> - --
> >> Guanglei Cui
> >>
> >>
> >
> >
> > --
> > Guanglei Cui
> >
> >
> >
> >
> > --
> > Aron Broom M.Sc
> > PhD Student
> > Department of Chemistry
> > University of Waterloo
>
>
>
> --
> Guanglei Cui

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:22:35 CST