Re: OpenCL and AMD GPUs

From: Axel Kohlmeyer (akohlmey_at_gmail.com)
Date: Wed Nov 23 2011 - 10:23:06 CST

On Wed, Nov 23, 2011 at 11:09 AM, Thomas Bishop <bishop_at_latech.edu> wrote:
> Axel,
> twice you have mentioned 4-way 8core systems.
> Why not  4-way 16core AMD systems w/out any GPUS.
> (this makes a nice farm instead of a cluster)

because those interlagos CPUs have a different
architecture, so that you won't get the expected
performance unless applications are aware of it.
they look good on paper, but apparently they need
a kernel tweak to make the cpu caches work
more effectively and you need a new-ish compiler
to take advantage of the 4-way fused multiply-add
and many other details. also the hardware topology
is different. unlike most vendor sales droids,
i only give recommendation on hardware that
i know about from personal experience with
the application at hand.

i also mention the 8-core magny-cours instead
of the 12-core since they 8-core ones are much
cheaper _and_ run at higher clock, but both have
the same memory bandwidth (which is - BTW -
also the same for the 16 core interlagos, so
they will be similarly restricted since MD does
require a certain amount of memory bandwidth
for traversing the neighbor lists all the time. after
all the math is designed to be cheap. otherwise
we all would be using morse potentials instead
of lennard jones...).

so at the moment, those new CPU machines
are a big unknown, which may prove to be
a worthy option in the future. but if you are
on a budget, there are two rules that you
have to obey. 1) never go for the top of the
line model and 2) never trust a vendor
recommendation. they don't care how something
holds up in practical use, because none
of them actually runs applications or has
the time to do those tests.

axel.

>
> But would 40Gbps IB keep up w/ such nodes
> or are there  other considerations w/ 4-way 16core nodes?
>
> THanks
> Tom
>
> On 11/23/2011 09:52 AM, Axel Kohlmeyer wrote:
>>
>> On Wed, Nov 23, 2011 at 10:06 AM, Aron Broom<broomsday_at_gmail.com>  wrote:
>>>
>>> @Nicholas,
>>>
>>>> I probably missed your point here, but on a Kentsfield Q6600 quad and
>>>> using the ApoA1 test we see x5 acceleration using a GTX460 card (going
>>>> from 0.44 ns/day without CUDA, to 2.22 ns/day with CUDA). Going five
>>>> times
>>>> faster is definitely not 'practically useless'. Having said that, for
>>>> the
>>>> needs of 'high powered computing initiatives' I'm not qualified to
>>>> speak.
>>>
>>> That is actually more or less the improvement that I see with 2 M2070
>>> cards
>>> attached to a hexa-core xeon.  That is quite surprising to me, and really
>>> quite fantastic.  What are the conditions of your system (size,
>>> electrostatic cutoff, etc)?  Certainly this completely negates my point
>>> if
>>> you can get equivalent acceleration from a consumer card.
>>
>> there are three issues to consider:
>> 1) consumer cards tend to have a higher clock on memory and GPU.
>>   for classical MD the memory access speed pretty much dominant
>>   in terms of how fast MD can run. the best indicator for this is the
>>   performance of an MD code that run entirely on the GPU, e.g. HOOMD
>>   http://codeblue.umich.edu/hoomd-blue/benchmarks.html
>>
>> 2) Tesla cards with ECC enabled take a (~10%?) performance hit
>>   in terms of memory access speed, which is dominating performance
>>   for high end hardware.
>>
>> 3) it matters what kind of PCI-e connection bandwidth you have.
>>   quite a few "GPU-ready" cluster nodes (to be equipped with M20x0
>>   GPUs) have only 8x PCI-e slots. for a code like NAMD that has
>>   to transfer a lot of information across the PCI-e bus in every time
>>   step, that matters a lot, particularly when oversubscribing the GPU.
>>   it gets worse for S2070 quad GPU nodes, since the add PCI-e
>>   bridges where two GPUs have to share a PCI-e slot in the host.
>>
>>> @ Axel
>>>
>>>> to get something implemented "soon" in a scientific software package,
>>>> you have basically two options: do it yourself, get the money to
>>>> hire somebody to do it. there is a severe limit of people capable
>>>> (and willing) to do this kind of work and unless somebody gets paid
>>>> for doing something, there are very limited chances to implement
>>>> something that already exists in some other shape or form
>>>> regardless how much sense it makes to redo it.
>>>
>>> Yes, I see the point.  Actually I've found that NAMD has quite a lot of
>>> features, and it's really quite impressive how well it compares with
>>> expensive packages like AMBER.  I know of at least one initiative to make
>>> a
>>
>> AMBER is *not* expensive. the performance of the amber MD codes
>> (sander and pmemd) suffers from having to carry so much legacy
>> around, since the amber package is much older than NAMD.
>> just have a look at CHARMm which has been around for so much
>> longer even. this is also one of the reasons why amber developers
>> get a higher speedup from using GPUs, since their CPU reference
>> is so much worse than the one in NAMD.
>>
>>> general CUDA->OpenCL code converter, done by Matt Harvey at Imperial
>>> College
>>> London, called Swan, but I'm not sure it would work in this case as a
>>> high
>>> degree of optimization is clearly needed.
>>
>> there are a ton of X->Y converters. none of them work perfectly.
>>
>> the better approach is to write code in a way that it can be
>> easily retargeted. CUDA code (when used through the driver,
>> not the runtime interfact) can be written and set up to look
>> very similar to OpenCL code and vice versa. one of the two
>> attempts to add GPU acceleration to LAMMPS, the one done
>> by mike brown at ORNL uses that approach:
>> http://users.nccs.gov/~wb8/geryon/index.htm
>> and we very recently succeeded in getting AMD hardware
>> (it was previously only tested with NVIDA hardware) working
>> fairly well with OpenCL using the exact same GPU kernels
>> mostly the same "glue code" plus the API abstraction from
>> his geryon headers.
>>
>>>> why no improvement. with SHAKE+RATTLE you should be able to run with
>>>> a 2fs time step. that is a serious improvement in my book.
>>>
>>> Yes, it's odd.  When I move from "rigidbonds water" to "rigid bonds all",
>>> and then go to a 2fs timestep, the speed per iteration slows down enough
>>> that the overall benefit in terms of ns/day is ~10%.  I'm not sure why
>>> the
>>> RATTLE algorithm is so costly in this case, it was using 2.8b2 though,
>>> and I
>>> haven't checked the release notes for the final 2.8 or b3, maybe there
>>> was
>>> an issue with RATTLE?.
>>
>> perhaps 2fs is requiring too many constraints iteration to achieve
>> the desired accuracy. i would try 1.5fs then. the problem with
>> shake is that it requires additional communication for
>> each iteration and that may be a limiting factor, as your hardware
>> seems to be limited in I/O bandwidth (which is a bad thing if
>> you are after getting good GPU speedup, BTW).
>>
>>>> this statement is not correct. having the 1/8th double precision
>>>> capability is fully sufficient for classical MD. the force computation
>>>> can be and is done in single precision with very little loss of
>>>> accuracy (the impact to accuracy similar to using a ~10 A cutoff
>>>> instead of ~12 A on lennard jones interactions) for as long as
>>>> the _accumulation_ of forces is done in double precision.
>>>> please keep in mind that the current GPU support in the NAMD
>>>> release version is still based on CUDA 2.x, which predates
>>>> fermi hardware.
>>>
>>> I'm a bit confused by this, but would be quite happy if you were correct
>>> (and Nicholas' response suggests you are).  I gathered from the NAMD
>>> website, and also from one of your discussions
>>> (http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l/9166.html) that
>>> all of the code was set in double precision mode (as opposed to packages
>>
>> no, you are misreading my post. and that was about the CPU
>> version anyway and an older version of NAMD. the GPU version
>> definitely uses single precision for the force computation.
>>
>>> like AMBER or GROMACS where this can be changed around).  So when you say
>>> that "the force computation can be and is done in single precision" do
>>> you
>>> mean that is the case for the NAMD code?  If this is true then it would
>>> explain why Nicholas sees the same degree of acceleration with a GTX460
>>> as I
>>> see with an M2070.  But still, I find this at odds with other information
>>> concerning NAMD and double precision.
>>
>> there is no conflict. GPU force kernels are different from CPU force
>> kernels and only the non-bonded force computation is GPU accelerated
>> in NAMD.
>>
>>>> first off, your assumptions about what kind of acceleration
>>>> GPUs can provide relative to what you can achieve with
>>>> current multi-core CPUs are highly exaggerated. you must
>>>> not compare peak floating point performance.
>>>
>>> To be fair, I wasn't comparing GPU to CPU when I quoted all those peak
>>> performance numbers, those were GPU to GPU.  The only CPU to GPU numbers
>>> I
>>
>> even those are not fair comparisons, since many factors
>> affect performance, not just the peak performance. vendors
>> like to push these numbers in your face, since this is often
>> all that they have, but a practitioner always also asks, how
>> well does it work at the end? how much more do i get done?
>> in the current situation, GPU can give you a boost when
>> running on a desktop and help when you need high throughput
>> (particularly though folding_at_home and similar projects),
>> but for capability calculations, i.e. where you want to run
>> as fast as possible and don't mind how many nodes or
>> cpus you use, CPUs are still unbeaten. GPUs still need
>> too large chunks of work to be efficient to compete with that.
>> perhaps in a few years when the integration of GPUs and
>> memory into the CPU will be complete thing may be different.
>>
>>> mentioned was a 4-5 fold improvement from adding 2 M2070s to a 6-core
>>> xeon,
>>> and this is true in my case.
>>>
>>>> if you want
>>>> to get the most bang for the buck, go with 4-way 8-core
>>>> AMD magny-cours processor based machines (the new
>>>> interlagos are priced temptingly and vendors are pushing
>>>> them like crazy, but they don't provide as good performance
>>>> with legacy binaries and require kernel upgrades for efficient
>>>> scheduling and better cache handling). you can set up a
>>>> kick-ass 32-core NAMD box for around $6,000 that will
>>>> beat almost any combination of CPU/GPU hardware that
>>>> you'll get for the same price and you won't have to worry
>>>> about how to run well on that machine or whether an
>>>> application is fully GPU accelerated.
>>>
>>> Thanks for the information, that is really quite interesting.  I suppose
>>> the
>>> architecture here plays a very important role.  I had compared against a
>>> CPU
>>
>> yes.
>>
>>> cluster (that was several years old), and found that my 6-core 2 M2070
>>> combination was the same speed as 96-CPU-cores.  But again, those CPUs
>>> were
>>> a bit older and I can imagine that having each CPU be 8-core on its own
>>> (I
>>> think this cluster was of dual-core chips) would improve scaling.
>>
>> not scaling, but overall performance. many-core CPUs actually
>> are a challenge to scaling, since you can quickly run into communication
>> contention issues. this is why NAMD has now hybrid SMP binaries
>> and also other code (i am providing this for LAMMPS) have moved
>> to hybrid MPI+Threading parallelization schemes for better scaling
>> across larger numbers of multi-core CPUs. the nice thing about the
>> 4-way 8-core CPU machines is, that their overall performance is often
>> sufficient for running routine NAMD ("capacity") calculations when
>> using only a single node. so not need to purchase an expensive
>> high-speed network, you don't even need racks. just a bunch of
>> these work stations tucked away in a closet. for bigger jobs one
>> should then move to proper clusters.
>>
>>>> bottom line: a) don't believe all the PR stuff that is thrown
>>>> at people as to how much faster things can become, but
>>>> get first hand experience and then a lot of things look
>>>> different. the IT industry has been able to fool people,
>>>> and especially people in research that have limited
>>>> technical experience for decades.
>>>> and b) remember that when something is technically
>>>> possible, it doesn't immediately translate into being
>>>> practically usable. any large scale scientific software
>>>> package can easily take 10 years from its initial inception
>>>> to becoming mature enough for widespread use. for
>>>> any disruptive technology to be integrated into such
>>>> a package you have to allow for at least half that amount
>>>> of time until it reaches a similar degree of maturity.
>>>
>>> Yes, I quite agree with a), the initial hype was that GPUs would give
>>> orders
>>> of magnitude faster performance, and while they are certainly nice, it
>>> has
>>
>> well, they do - for certain problems and for the accelerated
>> kernel only. but much of that added performance is lost
>> when running real applications thanks to amdahl's law.
>>
>>> not been as mind-boggling as originally suggested.  In terms of b), that
>>> is
>>> a useful bit of information that I'll keep in mind.  As I mostly just
>>> write
>>> scripts to simplify my work I've really no understanding of the workings
>>> of
>>> making a large and complicated software package.  Thanks a lot for the
>>> responses, very informative, and I'd really like to know about this
>>> double
>>> precision thing.
>>
>> that should be taken care of now, i hope.
>>
>> cheers,
>>      axel.
>>
>>> ~Aron
>>>
>>
>>
>
>
> --
> *******************************
>   Thomas C. Bishop
>    Tel: 318-257-5209
>    Fax: 318-257-3823
> http://dna.engr.latech.edu
> ********************************
>
>

-- 
Dr. Axel Kohlmeyer
akohlmey_at_gmail.com  http://goo.gl/1wk0
College of Science and Technology
Temple University, Philadelphia PA, USA.

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:01 CST