Re: OpenCL and AMD GPUs

From: Thomas Bishop (bishop_at_latech.edu)
Date: Wed Nov 23 2011 - 10:09:43 CST

Axel,
twice you have mentioned 4-way 8core systems.
Why not 4-way 16core AMD systems w/out any GPUS.
(this makes a nice farm instead of a cluster)

But would 40Gbps IB keep up w/ such nodes
or are there other considerations w/ 4-way 16core nodes?

THanks
Tom

On 11/23/2011 09:52 AM, Axel Kohlmeyer wrote:
> On Wed, Nov 23, 2011 at 10:06 AM, Aron Broom<broomsday_at_gmail.com> wrote:
>> @Nicholas,
>>
>>> I probably missed your point here, but on a Kentsfield Q6600 quad and
>>> using the ApoA1 test we see x5 acceleration using a GTX460 card (going
>>> from 0.44 ns/day without CUDA, to 2.22 ns/day with CUDA). Going five times
>>> faster is definitely not 'practically useless'. Having said that, for the
>>> needs of 'high powered computing initiatives' I'm not qualified to speak.
>> That is actually more or less the improvement that I see with 2 M2070 cards
>> attached to a hexa-core xeon. That is quite surprising to me, and really
>> quite fantastic. What are the conditions of your system (size,
>> electrostatic cutoff, etc)? Certainly this completely negates my point if
>> you can get equivalent acceleration from a consumer card.
> there are three issues to consider:
> 1) consumer cards tend to have a higher clock on memory and GPU.
> for classical MD the memory access speed pretty much dominant
> in terms of how fast MD can run. the best indicator for this is the
> performance of an MD code that run entirely on the GPU, e.g. HOOMD
> http://codeblue.umich.edu/hoomd-blue/benchmarks.html
>
> 2) Tesla cards with ECC enabled take a (~10%?) performance hit
> in terms of memory access speed, which is dominating performance
> for high end hardware.
>
> 3) it matters what kind of PCI-e connection bandwidth you have.
> quite a few "GPU-ready" cluster nodes (to be equipped with M20x0
> GPUs) have only 8x PCI-e slots. for a code like NAMD that has
> to transfer a lot of information across the PCI-e bus in every time
> step, that matters a lot, particularly when oversubscribing the GPU.
> it gets worse for S2070 quad GPU nodes, since the add PCI-e
> bridges where two GPUs have to share a PCI-e slot in the host.
>
>> @ Axel
>>
>>> to get something implemented "soon" in a scientific software package,
>>> you have basically two options: do it yourself, get the money to
>>> hire somebody to do it. there is a severe limit of people capable
>>> (and willing) to do this kind of work and unless somebody gets paid
>>> for doing something, there are very limited chances to implement
>>> something that already exists in some other shape or form
>>> regardless how much sense it makes to redo it.
>> Yes, I see the point. Actually I've found that NAMD has quite a lot of
>> features, and it's really quite impressive how well it compares with
>> expensive packages like AMBER. I know of at least one initiative to make a
> AMBER is *not* expensive. the performance of the amber MD codes
> (sander and pmemd) suffers from having to carry so much legacy
> around, since the amber package is much older than NAMD.
> just have a look at CHARMm which has been around for so much
> longer even. this is also one of the reasons why amber developers
> get a higher speedup from using GPUs, since their CPU reference
> is so much worse than the one in NAMD.
>
>> general CUDA->OpenCL code converter, done by Matt Harvey at Imperial College
>> London, called Swan, but I'm not sure it would work in this case as a high
>> degree of optimization is clearly needed.
> there are a ton of X->Y converters. none of them work perfectly.
>
> the better approach is to write code in a way that it can be
> easily retargeted. CUDA code (when used through the driver,
> not the runtime interfact) can be written and set up to look
> very similar to OpenCL code and vice versa. one of the two
> attempts to add GPU acceleration to LAMMPS, the one done
> by mike brown at ORNL uses that approach:
> http://users.nccs.gov/~wb8/geryon/index.htm
> and we very recently succeeded in getting AMD hardware
> (it was previously only tested with NVIDA hardware) working
> fairly well with OpenCL using the exact same GPU kernels
> mostly the same "glue code" plus the API abstraction from
> his geryon headers.
>
>>> why no improvement. with SHAKE+RATTLE you should be able to run with
>>> a 2fs time step. that is a serious improvement in my book.
>> Yes, it's odd. When I move from "rigidbonds water" to "rigid bonds all",
>> and then go to a 2fs timestep, the speed per iteration slows down enough
>> that the overall benefit in terms of ns/day is ~10%. I'm not sure why the
>> RATTLE algorithm is so costly in this case, it was using 2.8b2 though, and I
>> haven't checked the release notes for the final 2.8 or b3, maybe there was
>> an issue with RATTLE?.
> perhaps 2fs is requiring too many constraints iteration to achieve
> the desired accuracy. i would try 1.5fs then. the problem with
> shake is that it requires additional communication for
> each iteration and that may be a limiting factor, as your hardware
> seems to be limited in I/O bandwidth (which is a bad thing if
> you are after getting good GPU speedup, BTW).
>
>>> this statement is not correct. having the 1/8th double precision
>>> capability is fully sufficient for classical MD. the force computation
>>> can be and is done in single precision with very little loss of
>>> accuracy (the impact to accuracy similar to using a ~10 A cutoff
>>> instead of ~12 A on lennard jones interactions) for as long as
>>> the _accumulation_ of forces is done in double precision.
>>> please keep in mind that the current GPU support in the NAMD
>>> release version is still based on CUDA 2.x, which predates
>>> fermi hardware.
>> I'm a bit confused by this, but would be quite happy if you were correct
>> (and Nicholas' response suggests you are). I gathered from the NAMD
>> website, and also from one of your discussions
>> (http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l/9166.html) that
>> all of the code was set in double precision mode (as opposed to packages
> no, you are misreading my post. and that was about the CPU
> version anyway and an older version of NAMD. the GPU version
> definitely uses single precision for the force computation.
>
>> like AMBER or GROMACS where this can be changed around). So when you say
>> that "the force computation can be and is done in single precision" do you
>> mean that is the case for the NAMD code? If this is true then it would
>> explain why Nicholas sees the same degree of acceleration with a GTX460 as I
>> see with an M2070. But still, I find this at odds with other information
>> concerning NAMD and double precision.
> there is no conflict. GPU force kernels are different from CPU force
> kernels and only the non-bonded force computation is GPU accelerated
> in NAMD.
>
>>> first off, your assumptions about what kind of acceleration
>>> GPUs can provide relative to what you can achieve with
>>> current multi-core CPUs are highly exaggerated. you must
>>> not compare peak floating point performance.
>> To be fair, I wasn't comparing GPU to CPU when I quoted all those peak
>> performance numbers, those were GPU to GPU. The only CPU to GPU numbers I
> even those are not fair comparisons, since many factors
> affect performance, not just the peak performance. vendors
> like to push these numbers in your face, since this is often
> all that they have, but a practitioner always also asks, how
> well does it work at the end? how much more do i get done?
> in the current situation, GPU can give you a boost when
> running on a desktop and help when you need high throughput
> (particularly though folding_at_home and similar projects),
> but for capability calculations, i.e. where you want to run
> as fast as possible and don't mind how many nodes or
> cpus you use, CPUs are still unbeaten. GPUs still need
> too large chunks of work to be efficient to compete with that.
> perhaps in a few years when the integration of GPUs and
> memory into the CPU will be complete thing may be different.
>
>> mentioned was a 4-5 fold improvement from adding 2 M2070s to a 6-core xeon,
>> and this is true in my case.
>>
>>> if you want
>>> to get the most bang for the buck, go with 4-way 8-core
>>> AMD magny-cours processor based machines (the new
>>> interlagos are priced temptingly and vendors are pushing
>>> them like crazy, but they don't provide as good performance
>>> with legacy binaries and require kernel upgrades for efficient
>>> scheduling and better cache handling). you can set up a
>>> kick-ass 32-core NAMD box for around $6,000 that will
>>> beat almost any combination of CPU/GPU hardware that
>>> you'll get for the same price and you won't have to worry
>>> about how to run well on that machine or whether an
>>> application is fully GPU accelerated.
>> Thanks for the information, that is really quite interesting. I suppose the
>> architecture here plays a very important role. I had compared against a CPU
> yes.
>
>> cluster (that was several years old), and found that my 6-core 2 M2070
>> combination was the same speed as 96-CPU-cores. But again, those CPUs were
>> a bit older and I can imagine that having each CPU be 8-core on its own (I
>> think this cluster was of dual-core chips) would improve scaling.
> not scaling, but overall performance. many-core CPUs actually
> are a challenge to scaling, since you can quickly run into communication
> contention issues. this is why NAMD has now hybrid SMP binaries
> and also other code (i am providing this for LAMMPS) have moved
> to hybrid MPI+Threading parallelization schemes for better scaling
> across larger numbers of multi-core CPUs. the nice thing about the
> 4-way 8-core CPU machines is, that their overall performance is often
> sufficient for running routine NAMD ("capacity") calculations when
> using only a single node. so not need to purchase an expensive
> high-speed network, you don't even need racks. just a bunch of
> these work stations tucked away in a closet. for bigger jobs one
> should then move to proper clusters.
>
>>> bottom line: a) don't believe all the PR stuff that is thrown
>>> at people as to how much faster things can become, but
>>> get first hand experience and then a lot of things look
>>> different. the IT industry has been able to fool people,
>>> and especially people in research that have limited
>>> technical experience for decades.
>>> and b) remember that when something is technically
>>> possible, it doesn't immediately translate into being
>>> practically usable. any large scale scientific software
>>> package can easily take 10 years from its initial inception
>>> to becoming mature enough for widespread use. for
>>> any disruptive technology to be integrated into such
>>> a package you have to allow for at least half that amount
>>> of time until it reaches a similar degree of maturity.
>> Yes, I quite agree with a), the initial hype was that GPUs would give orders
>> of magnitude faster performance, and while they are certainly nice, it has
> well, they do - for certain problems and for the accelerated
> kernel only. but much of that added performance is lost
> when running real applications thanks to amdahl's law.
>
>> not been as mind-boggling as originally suggested. In terms of b), that is
>> a useful bit of information that I'll keep in mind. As I mostly just write
>> scripts to simplify my work I've really no understanding of the workings of
>> making a large and complicated software package. Thanks a lot for the
>> responses, very informative, and I'd really like to know about this double
>> precision thing.
> that should be taken care of now, i hope.
>
> cheers,
> axel.
>
>> ~Aron
>>
>
>

-- 
*******************************
    Thomas C. Bishop
     Tel: 318-257-5209
     Fax: 318-257-3823
http://dna.engr.latech.edu
********************************

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:01 CST