Re: OpenCL and AMD GPUs

From: Axel Kohlmeyer (akohlmey_at_gmail.com)
Date: Wed Nov 23 2011 - 09:52:47 CST

On Wed, Nov 23, 2011 at 10:06 AM, Aron Broom <broomsday_at_gmail.com> wrote:
> @Nicholas,
>
>> I probably missed your point here, but on a Kentsfield Q6600 quad and
>> using the ApoA1 test we see x5 acceleration using a GTX460 card (going
>> from 0.44 ns/day without CUDA, to 2.22 ns/day with CUDA). Going five times
>> faster is definitely not 'practically useless'. Having said that, for the
>> needs of 'high powered computing initiatives' I'm not qualified to speak.
>
> That is actually more or less the improvement that I see with 2 M2070 cards
> attached to a hexa-core xeon.  That is quite surprising to me, and really
> quite fantastic.  What are the conditions of your system (size,
> electrostatic cutoff, etc)?  Certainly this completely negates my point if
> you can get equivalent acceleration from a consumer card.

there are three issues to consider:
1) consumer cards tend to have a higher clock on memory and GPU.
  for classical MD the memory access speed pretty much dominant
  in terms of how fast MD can run. the best indicator for this is the
  performance of an MD code that run entirely on the GPU, e.g. HOOMD
  http://codeblue.umich.edu/hoomd-blue/benchmarks.html

2) Tesla cards with ECC enabled take a (~10%?) performance hit
  in terms of memory access speed, which is dominating performance
  for high end hardware.

3) it matters what kind of PCI-e connection bandwidth you have.
  quite a few "GPU-ready" cluster nodes (to be equipped with M20x0
  GPUs) have only 8x PCI-e slots. for a code like NAMD that has
  to transfer a lot of information across the PCI-e bus in every time
  step, that matters a lot, particularly when oversubscribing the GPU.
  it gets worse for S2070 quad GPU nodes, since the add PCI-e
  bridges where two GPUs have to share a PCI-e slot in the host.

> @ Axel
>
>> to get something implemented "soon" in a scientific software package,
>> you have basically two options: do it yourself, get the money to
>> hire somebody to do it. there is a severe limit of people capable
>> (and willing) to do this kind of work and unless somebody gets paid
>> for doing something, there are very limited chances to implement
>> something that already exists in some other shape or form
>> regardless how much sense it makes to redo it.
>
> Yes, I see the point.  Actually I've found that NAMD has quite a lot of
> features, and it's really quite impressive how well it compares with
> expensive packages like AMBER.  I know of at least one initiative to make a

AMBER is *not* expensive. the performance of the amber MD codes
(sander and pmemd) suffers from having to carry so much legacy
around, since the amber package is much older than NAMD.
just have a look at CHARMm which has been around for so much
longer even. this is also one of the reasons why amber developers
get a higher speedup from using GPUs, since their CPU reference
is so much worse than the one in NAMD.

> general CUDA->OpenCL code converter, done by Matt Harvey at Imperial College
> London, called Swan, but I'm not sure it would work in this case as a high
> degree of optimization is clearly needed.

there are a ton of X->Y converters. none of them work perfectly.

the better approach is to write code in a way that it can be
easily retargeted. CUDA code (when used through the driver,
not the runtime interfact) can be written and set up to look
very similar to OpenCL code and vice versa. one of the two
attempts to add GPU acceleration to LAMMPS, the one done
by mike brown at ORNL uses that approach:
http://users.nccs.gov/~wb8/geryon/index.htm
and we very recently succeeded in getting AMD hardware
(it was previously only tested with NVIDA hardware) working
fairly well with OpenCL using the exact same GPU kernels
mostly the same "glue code" plus the API abstraction from
his geryon headers.

>
>> why no improvement. with SHAKE+RATTLE you should be able to run with
>> a 2fs time step. that is a serious improvement in my book.
>
> Yes, it's odd.  When I move from "rigidbonds water" to "rigid bonds all",
> and then go to a 2fs timestep, the speed per iteration slows down enough
> that the overall benefit in terms of ns/day is ~10%.  I'm not sure why the
> RATTLE algorithm is so costly in this case, it was using 2.8b2 though, and I
> haven't checked the release notes for the final 2.8 or b3, maybe there was
> an issue with RATTLE?.

perhaps 2fs is requiring too many constraints iteration to achieve
the desired accuracy. i would try 1.5fs then. the problem with
shake is that it requires additional communication for
each iteration and that may be a limiting factor, as your hardware
seems to be limited in I/O bandwidth (which is a bad thing if
you are after getting good GPU speedup, BTW).

>> this statement is not correct. having the 1/8th double precision
>> capability is fully sufficient for classical MD. the force computation
>> can be and is done in single precision with very little loss of
>> accuracy (the impact to accuracy similar to using a ~10 A cutoff
>> instead of ~12 A on lennard jones interactions) for as long as
>> the _accumulation_ of forces is done in double precision.
>> please keep in mind that the current GPU support in the NAMD
>> release version is still based on CUDA 2.x, which predates
>> fermi hardware.
>
> I'm a bit confused by this, but would be quite happy if you were correct
> (and Nicholas' response suggests you are).  I gathered from the NAMD
> website, and also from one of your discussions
> (http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l/9166.html) that
> all of the code was set in double precision mode (as opposed to packages

no, you are misreading my post. and that was about the CPU
version anyway and an older version of NAMD. the GPU version
definitely uses single precision for the force computation.

> like AMBER or GROMACS where this can be changed around).  So when you say
> that "the force computation can be and is done in single precision" do you
> mean that is the case for the NAMD code?  If this is true then it would
> explain why Nicholas sees the same degree of acceleration with a GTX460 as I
> see with an M2070.  But still, I find this at odds with other information
> concerning NAMD and double precision.

there is no conflict. GPU force kernels are different from CPU force
kernels and only the non-bonded force computation is GPU accelerated
in NAMD.

>> first off, your assumptions about what kind of acceleration
>> GPUs can provide relative to what you can achieve with
>> current multi-core CPUs are highly exaggerated. you must
>> not compare peak floating point performance.
>
> To be fair, I wasn't comparing GPU to CPU when I quoted all those peak
> performance numbers, those were GPU to GPU.  The only CPU to GPU numbers I

even those are not fair comparisons, since many factors
affect performance, not just the peak performance. vendors
like to push these numbers in your face, since this is often
all that they have, but a practitioner always also asks, how
well does it work at the end? how much more do i get done?
in the current situation, GPU can give you a boost when
running on a desktop and help when you need high throughput
(particularly though folding_at_home and similar projects),
but for capability calculations, i.e. where you want to run
as fast as possible and don't mind how many nodes or
cpus you use, CPUs are still unbeaten. GPUs still need
too large chunks of work to be efficient to compete with that.
perhaps in a few years when the integration of GPUs and
memory into the CPU will be complete thing may be different.

> mentioned was a 4-5 fold improvement from adding 2 M2070s to a 6-core xeon,
> and this is true in my case.
>
>> if you want
>> to get the most bang for the buck, go with 4-way 8-core
>> AMD magny-cours processor based machines (the new
>> interlagos are priced temptingly and vendors are pushing
>> them like crazy, but they don't provide as good performance
>> with legacy binaries and require kernel upgrades for efficient
>> scheduling and better cache handling). you can set up a
>> kick-ass 32-core NAMD box for around $6,000 that will
>> beat almost any combination of CPU/GPU hardware that
>> you'll get for the same price and you won't have to worry
>> about how to run well on that machine or whether an
>> application is fully GPU accelerated.
>
> Thanks for the information, that is really quite interesting.  I suppose the
> architecture here plays a very important role.  I had compared against a CPU

yes.

> cluster (that was several years old), and found that my 6-core 2 M2070
> combination was the same speed as 96-CPU-cores.  But again, those CPUs were
> a bit older and I can imagine that having each CPU be 8-core on its own (I
> think this cluster was of dual-core chips) would improve scaling.

not scaling, but overall performance. many-core CPUs actually
are a challenge to scaling, since you can quickly run into communication
contention issues. this is why NAMD has now hybrid SMP binaries
and also other code (i am providing this for LAMMPS) have moved
to hybrid MPI+Threading parallelization schemes for better scaling
across larger numbers of multi-core CPUs. the nice thing about the
4-way 8-core CPU machines is, that their overall performance is often
sufficient for running routine NAMD ("capacity") calculations when
using only a single node. so not need to purchase an expensive
high-speed network, you don't even need racks. just a bunch of
these work stations tucked away in a closet. for bigger jobs one
should then move to proper clusters.

>> bottom line: a) don't believe all the PR stuff that is thrown
>> at people as to how much faster things can become, but
>> get first hand experience and then a lot of things look
>> different. the IT industry has been able to fool people,
>> and especially people in research that have limited
>> technical experience for decades.
>> and b) remember that when something is technically
>> possible, it doesn't immediately translate into being
>> practically usable. any large scale scientific software
>> package can easily take 10 years from its initial inception
>> to becoming mature enough for widespread use. for
>> any disruptive technology to be integrated into such
>> a package you have to allow for at least half that amount
>> of time until it reaches a similar degree of maturity.
>
> Yes, I quite agree with a), the initial hype was that GPUs would give orders
> of magnitude faster performance, and while they are certainly nice, it has

well, they do - for certain problems and for the accelerated
kernel only. but much of that added performance is lost
when running real applications thanks to amdahl's law.

> not been as mind-boggling as originally suggested.  In terms of b), that is
> a useful bit of information that I'll keep in mind.  As I mostly just write
> scripts to simplify my work I've really no understanding of the workings of
> making a large and complicated software package.  Thanks a lot for the
> responses, very informative, and I'd really like to know about this double
> precision thing.

that should be taken care of now, i hope.

cheers,
     axel.

> ~Aron
>

-- 
Dr. Axel Kohlmeyer
akohlmey_at_gmail.com  http://goo.gl/1wk0
College of Science and Technology
Temple University, Philadelphia PA, USA.

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:01 CST