Re: no. of CPUs for optimal GTX-690 performance

From: Axel Kohlmeyer (akohlmey_at_gmail.com)
Date: Wed Nov 07 2012 - 02:45:29 CST

On Wed, Nov 7, 2012 at 9:16 AM, Aron Broom <broomsday_at_gmail.com> wrote:

> Although I also wonder about planning for the future. That is, in
> 6-months or whenever, will the next NAMD release have a binary that is
> completely ported to the GPU minus the very infrequent file I/O? If so,
> those nice CPUs would be largely wasted.
>

for a parallel MD application, there is - for the foreseeable future - very
little reason to run the entire calculation on the GPU.

- only some parts of the calculation run faster on the GPU
  bonded interactions would be slower, lots of add-ons don't
  make sense on the GPU (do you want to lose Tcl scripting?)

- you will have to exchange data with neighboring processes

- since you cannot avoid having to have up-to-date coordinate
  data available on the CPU, it will be *even faster* to run CPU
  and GPU concurrently using asynchronous GPU tasks.

- upcoming tesla models will make GPU sharing between
  CPU cores much more efficient (sadly, that is not likely
  to be enabled on the much more attractively priced
  consumer grade GPUs, which were the reason why so
  many people got into GPU computing in the first place).

you can have a "preview" of this future, if you look at the
LAMMPS MD code, since it has two complementary
modules for GPU acceleration. one tries to keep the data
on the GPU, the other aims to only use GPU acceleration
on parts of the code where the gain is the largest.

the "(almost) everything on the GPU" approach
- doesn't allow GPU sharing (well you can, but it is horribly slow)
- is most efficient when you have a large number of atoms per GPU
- benefits significantly from GPU direct
- does time integration etc. on the GPU
- still runs bonded interactions on the CPU (benefits from OpenMP there)
  (one can seen from MD codes like HOOMD that bonds on the GPU are slower)
- has a performance drop, as soon as you need a feature like adding
  custom forces that is not supported on the GPU (due to GPU to host I/O)
=> best for small clusters and workstation/desktop/laptop environments

the "minimalist" approach
- is fastest when running across a large number of CPUs+GPUs
   with a moderate numbers of atoms per GPU.
- only supports non-bonded, neighborlists, and reciprocal space (pppm)
  computation on the GPU.
- runs bonded and (optionally) reciprocal space concurrently
  with the GPU (for a larger number of processes it is faster
  to not use the GPU for pppm)
=> best for large clusters and supercomputers (e.g. Titan @ ORNL).

HTH,
     axel.

>
> Just a thought. I guess it depends on their cost, but I'd imagine a nice
> workstation CPU is ~the cost of another GTX-690.
>
> ~Aron
>
>
> On Wed, Nov 7, 2012 at 2:56 AM, Norman Geist <
> norman.geist_at_uni-greifswald.de> wrote:
>
>> Hi,****
>>
>> ** **
>>
>> just a few things came to my mind:****
>>
>> ** **
>>
>> **1. **Im using each GPU (Tesla C2050) with one Xeon E5649 6-core
>> 2.53GHz what lead to nice utilization of the GPU with a system thats big
>> enough (but fullelectfrequency 4)****
>>
>> **2. **Sandy Bridge series have doubled floating point performance
>> with 8/4 single/double precision flops/cycle.****
>>
>> **3. **Kepler GPUs have doubled performance due 3 times the cores
>> and half the clock rate compared to Fermi.****
>>
>> ** **
>>
>> All this means, that you are going to bind doubled CPU power with
>> doubled GPU power, therefore I would say one 6-core sandy bridge per Kepler
>> GPU, should be the same relation as mine (both doubled performance),
>> neglected the Tesla feature of being able to use the pcie bi-directional
>> due two dma engines (not used by NAMD) and PCIe3 is surely necessary as the
>> need for data transfer will increase heavily.****
>>
>> ** **
>>
>> Good luck****
>>
>> ** **
>>
>> Norman Geist.****
>>
>> ** **
>>
>> *Von:* owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] *Im
>> Auftrag von *Giacomo Fiorin
>> *Gesendet:* Dienstag, 6. November 2012 21:46
>> *An:* mpurdy_at_virginia.edu
>> *Cc:* NAMD list
>> *Betreff:* Re: namd-l: no. of CPUs for optimal GTX-690 performance****
>>
>> ** **
>>
>> One additional thing that complicates things for Sandy Bridge processors
>> is the Turbo Boost. You had equal speed between 4 cores and 7 cores, so
>> things were not going so well. Many people have dealt with this problem
>> for benchmarking purposes, and posted different solutions online to disable
>> it. (Ajasja: how are your scalings without GPU?)****
>>
>> ** **
>>
>> In any case, the main problem is most often the limited bandwidth between
>> CPU and GPU, like Ajasja and Aron already said. The motherboard that
>> you're planning to use is a good choice, the one you're currently making
>> tests on may not be: what is it?. Also not knowing which Opterons you had
>> nor the PCI-e bus speed, the comparison you made with the ThinkPad is not
>> informative.****
>>
>> ** **
>>
>> That said, I don't think it's worth going beyond 1 CPU for every GPU.
>> First, it will be hard to find suitable motherboards. Second and most
>> important, 12-16 CPU cores plus 2 GPUs all exchanging data on the same bus
>> will probably already clog up the PCI-e bus. I agree with Ajasja that
>> hyperthreading may be useless, and actually harmful if you're sharing the
>> bandwidth (that would be 24-32 CPU cores.. again all sharing the same bus).
>> ****
>>
>> ** **
>>
>> On which CPU, I would vote for less cores but higher clock (e.g. Xeon
>> 2640 or 2667), if you're planning to use them with a GPU.****
>>
>> ** **
>>
>> Giacomo****
>>
>> ** **
>>
>> On Tue, Nov 6, 2012 at 2:01 PM, Michael Purdy <mdp3w_at_virginia.edu> wrote:
>> ****
>>
>> Hello, I am running NAMD simulations (multicore-CUDA) on a ThinkPad with
>> dual Core i7-2760QM CPUs and a Quadro 2000M running Debian. For a 150k atom
>> system I get performance like this:
>>
>> Benchmark time: 4 CPUs 0.287062 s/step 1.66124 days/ns 387.641 MB memory
>> Benchmark time: 7 CPUs 0.289229 s/step 1.67378 days/ns 428.574 MB memory
>>
>> Things are going well so we purchased a GTX-690 which we installed in a
>> workstation with two dual core Opterons, which is evidently far short of
>> the CPU cores we need to get the most of the 2 GPUs and 3072 cuda cores.
>> Performance was just slightly better than the ThinkPad:
>>
>> Benchmark time: 4 CPUs ~0.2 s/step ~1.4 days/ns
>>
>> We would like to build a new workstation to get the most out of the
>> GTX-690 and I'd like to know how many CPU cores we need. I'm considering
>> two Core i7-3930k (6-core/12-thread) or two Xeon E5-2650
>> (8-core/16-thread). Will either of these be a good match for the GTX-690 or
>> will I still be short running short on CPUs? The current plans is to build
>> this on an Asus Z9PE-D8 WS board.
>>
>> Michael
>>
>>
>> ****
>>
>> ** **
>>
>
>
>
> --
> Aron Broom M.Sc
> PhD Student
> Department of Chemistry
> University of Waterloo
>
>

-- 
Dr. Axel Kohlmeyer  akohlmey_at_gmail.com  http://goo.gl/1wk0
International Centre for Theoretical Physics, Trieste. Italy.

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:22:13 CST