Re: OpenCL and AMD GPUs

From: Aron Broom (broomsday_at_gmail.com)
Date: Wed Nov 23 2011 - 09:06:13 CST

@Nicholas,

> I probably missed your point here, but on a Kentsfield Q6600 quad and
> using the ApoA1 test we see x5 acceleration using a GTX460 card (going
> from 0.44 ns/day without CUDA, to 2.22 ns/day with CUDA). Going five times
> faster is definitely not 'practically useless'. Having said that, for the
> needs of 'high powered computing initiatives' I'm not qualified to speak.

That is actually more or less the improvement that I see with 2 M2070 cards
attached to a hexa-core xeon. That is quite surprising to me, and really
quite fantastic. What are the conditions of your system (size,
electrostatic cutoff, etc)? Certainly this completely negates my point if
you can get equivalent acceleration from a consumer card.

@ Axel

> to get something implemented "soon" in a scientific software package,
> you have basically two options: do it yourself, get the money to
> hire somebody to do it. there is a severe limit of people capable
> (and willing) to do this kind of work and unless somebody gets paid
> for doing something, there are very limited chances to implement
> something that already exists in some other shape or form
> regardless how much sense it makes to redo it.

Yes, I see the point. Actually I've found that NAMD has quite a lot of
features, and it's really quite impressive how well it compares with
expensive packages like AMBER. I know of at least one initiative to make a
general CUDA->OpenCL code converter, done by Matt Harvey at Imperial
College London, called Swan, but I'm not sure it would work in this case as
a high degree of optimization is clearly needed.

> why no improvement. with SHAKE+RATTLE you should be able to run with
> a 2fs time step. that is a serious improvement in my book.

Yes, it's odd. When I move from "rigidbonds water" to "rigid bonds all",
and then go to a 2fs timestep, the speed per iteration slows down enough
that the overall benefit in terms of ns/day is ~10%. I'm not sure why the
RATTLE algorithm is so costly in this case, it was using 2.8b2 though, and
I haven't checked the release notes for the final 2.8 or b3, maybe there
was an issue with RATTLE?.

> this statement is not correct. having the 1/8th double precision
> capability is fully sufficient for classical MD. the force computation
> can be and is done in single precision with very little loss of
> accuracy (the impact to accuracy similar to using a ~10 A cutoff
> instead of ~12 A on lennard jones interactions) for as long as
> the _accumulation_ of forces is done in double precision.
> please keep in mind that the current GPU support in the NAMD
> release version is still based on CUDA 2.x, which predates
> fermi hardware.

I'm a bit confused by this, but would be quite happy if you were correct
(and Nicholas' response suggests you are). I gathered from the NAMD
website, and also from one of your discussions (
http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l/9166.html) that
all of the code was set in double precision mode (as opposed to packages
like AMBER or GROMACS where this can be changed around). So when you say
that "the force computation can be and is done in single precision" do you
mean that is the case for the NAMD code? If this is true then it would
explain why Nicholas sees the same degree of acceleration with a GTX460 as
I see with an M2070. But still, I find this at odds with other information
concerning NAMD and double precision.

> first off, your assumptions about what kind of acceleration
> GPUs can provide relative to what you can achieve with
> current multi-core CPUs are highly exaggerated. you must
> not compare peak floating point performance.

To be fair, I wasn't comparing GPU to CPU when I quoted all those peak
performance numbers, those were GPU to GPU. The only CPU to GPU numbers I
mentioned was a 4-5 fold improvement from adding 2 M2070s to a 6-core xeon,
and this is true in my case.

> if you want
> to get the most bang for the buck, go with 4-way 8-core
> AMD magny-cours processor based machines (the new
> interlagos are priced temptingly and vendors are pushing
> them like crazy, but they don't provide as good performance
> with legacy binaries and require kernel upgrades for efficient
> scheduling and better cache handling). you can set up a
> kick-ass 32-core NAMD box for around $6,000 that will
> beat almost any combination of CPU/GPU hardware that
> you'll get for the same price and you won't have to worry
> about how to run well on that machine or whether an
> application is fully GPU accelerated.

Thanks for the information, that is really quite interesting. I suppose
the architecture here plays a very important role. I had compared against
a CPU cluster (that was several years old), and found that my 6-core 2
M2070 combination was the same speed as 96-CPU-cores. But again, those
CPUs were a bit older and I can imagine that having each CPU be 8-core on
its own (I think this cluster was of dual-core chips) would improve scaling.

> bottom line: a) don't believe all the PR stuff that is thrown
> at people as to how much faster things can become, but
> get first hand experience and then a lot of things look
> different. the IT industry has been able to fool people,
> and especially people in research that have limited
> technical experience for decades.
> and b) remember that when something is technically
> possible, it doesn't immediately translate into being
> practically usable. any large scale scientific software
> package can easily take 10 years from its initial inception
> to becoming mature enough for widespread use. for
> any disruptive technology to be integrated into such
> a package you have to allow for at least half that amount
> of time until it reaches a similar degree of maturity.

Yes, I quite agree with a), the initial hype was that GPUs would give
orders of magnitude faster performance, and while they are certainly nice,
it has not been as mind-boggling as originally suggested. In terms of b),
that is a useful bit of information that I'll keep in mind. As I mostly
just write scripts to simplify my work I've really no understanding of the
workings of making a large and complicated software package. Thanks a lot
for the responses, very informative, and I'd really like to know about this
double precision thing.

~Aron

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 05:24:33 CST