Re: OpenCL and AMD GPUs

From: Axel Kohlmeyer (
Date: Wed Nov 23 2011 - 05:07:52 CST

On Tue, Nov 22, 2011 at 11:04 PM, Aron Broom <> wrote:
> I'd like to present an idea for a future feature for NAMD, support for
> OpenCL.  I think this is already being considered to some extent, but I want
> to really show the full value of this.
> My understanding is that at the moment GPU acceleration of non-bonded force
> calculations only takes place using CUDA, and that in general this means
> nVidia GPU cards.  I searched through the mailing list archives and couldn't
> find much discussion on this topic, although there was a post from VMD
> developer John Stone
> (
> suggesting that OpenCL needs to mature a bit more before it will be standard
> between devices and easily implemented.  I'd like to make an argument for
> why it might be extremely worthwhile to get it working soon.


to get something implemented "soon" in a scientific software package,
you have basically two options: do it yourself, get the money to
hire somebody to do it. there is a severe limit of people capable
(and willing) to do this kind of work and unless somebody gets paid
for doing something, there are very limited chances to implement
something that already exists in some other shape or form
regardless how much sense it makes to redo it.

> I've been recently running NAMD 2.8 on some nVidia M2070 boxes.  The
> objective thus far has been free energy determination using various
> methods.  At the moment, using an intel xeon 6-core CPU, I get 0.65 ns/day
> for a 100k atom system using PME and an electrostatic cutoff of 14 angstroms
> with a timestep of 1fs and rigid waters (making all bonds rigid and using
> SHAKE or RATTLE does not offer a real improvement in my case).  Adding in 2

why no improvement. with SHAKE+RATTLE you should be able to run with
a 2fs time step. that is a serious improvement in my book.

> nVidia M2070s to that mix increases performance to 2.54 ns/day, a 4-fold
> improvement (1.94 ns/day with 1 M2070).  This is quite nice, but the cost of
> an M2070 or the newer M2090 is ~$3000.

yes. using tesla cards in these cases is a serious waste of money,
unless you have to run a *lot* of them and the better management
features and quality control help you cut down on human cost.

> Now, the consumer graphics cards that are based on the same Fermi chip (i.e.
> nVidia GTX 580) have the same number of processor cores and should be just
> as fast as the M2090, but only cost ~$500.  Of course these consumer cards

they are _faster_ since their GPUs have higher clocks and faster memory.

> work fine with NAMD as they are fully CUDA supported, but there are 3
> problems, 2 of which are minor, while the 3rd is catastrophic.  The first is
> that the memory available on a GTX 580 is 1.5 GB compared with 6 GB on the
> M2090.  This doesn't actually matter that much for a large portion of NAMD
> tasks.  For instance my 100k system (which I think is about middle for
> system sizes these days) uses less than 1GB of memory on the GPU.  The

actually, there are 3GB versions of GTX 580s and the extra memory
not that much more expensive.

> second problem is the lack of error correcting code (ECC) on the GTXs.  I'm
> going to contend that for NAMD this actually isn't that critical.  NAMD uses

yes. classical MD is a "conservative method", small random errors will
push you into a different but equivalent state out of very many and your
sampling will be equivalent. also, using tools like cuda_memtest, you can
do a pretty decent and automated sanity check before launching a GPU
job in data center conditions.

> double precision floating point values in order to avoid accumulating errors
> from rounding and this is quite critical for getting the right answer from a
> long MD simulation.  By contrast a flipped bit in memory will cause a
> singular error in the simulation (resulting in an incorrect velocity most
> likely), which, thanks to thermostats will be attenuated as the simulation
> progresses rather than being built upon (but you could argue against me
> here), and since we generally want to do the final production simulations in
> replicate, it matters even less.  The last problem, the real one, is that
> nVidia, realizing that we might not want to spend 6 times the price just for
> the extra memory and ECC, has artificially reduced the double precision
> floating point performance of the consumer cards, from being 1/2 of the
> single precision value in the M20xx series, to 1/8 in the GTX cards
> (  This means that
> these cards are practically useless for NAMD (thereby forcing high powered
> computing initiatives to purchase M20xx cards).

this statement is not correct. having the 1/8th double precision
capability is fully sufficient for classical MD. the force computation
can be and is done in single precision with very little loss of
accuracy (the impact to accuracy similar to using a ~10 A cutoff
instead of ~12 A on lennard jones interactions) for as long as
the _accumulation_ of forces is done in double precision.
please keep in mind that the current GPU support in the NAMD
release version is still based on CUDA 2.x, which predates
fermi hardware.

> But what about the equivalent AMD cards?  These have not been artificially
> crippled.  If we look at the manufacturer supplied peak performance specs,
> an M2090 gives 666 GFlops of double precision performance
> (
> By comparison a $350 Radeon HD 6970 gives 675 GFlops of double precision
> performance, owing to their larger number of stream processors
> (
> This menas that if you could run NAMD on an AMD card you could potentially
> get the same performance for ~1/10th of the cost (and the Radeon HD 6970 has
> 2GB of memory, enough for pretty large systems).  Unfortunately, the AMD
> cards don't run CUDA.  If NAMD could work with OpenCL we could be in a
> position where everyone could have a desktop computer with the same
> computing performance as a multiple thousand dollar supercomputer (at least
> as far as molecular dynamics on NAMD were concerned).
> Thoughts?

first off, your assumptions about what kind of acceleration
GPUs can provide relative to what you can achieve with
current multi-core CPUs are highly exaggerated. you must
not compare peak floating point performance. if you want
to get the most bang for the buck, go with 4-way 8-core
AMD magny-cours processor based machines (the new
interlagos are priced temptingly and vendors are pushing
them like crazy, but they don't provide as good performance
with legacy binaries and require kernel upgrades for efficient
scheduling and better cache handling). you can set up a
kick-ass 32-core NAMD box for around $6,000 that will
beat almost any combination of CPU/GPU hardware that
you'll get for the same price and you won't have to worry
about how to run well on that machine or whether an
application is fully GPU accelerated.

as for AMD GPUs. i just set up a machine with a bunch
of AMD GPUs to test LAMMPS (which does have OpenCL
support by exploiting the similarities between CUDA and
OpenCL when the former is programmed through the driver
interface, so only some preprocessing magic and a little
glue code is needed to add OpenCL support to the CUDA
support for non-bonded and kspace support) and the results
are encouraging, but john's maturity concerns are still
valid. there are a ton of little practical things that make
it not yet feasible to run AMD GPUs in a data center setting
at the same level of reliability and stability as NVIDIA GPUs.
from having gone through similar experiences with the
nvidia hardware, i would say that you have to give it a couple
more years... it is getting closer and AMD - same as NVIDIA -
is donating hardware and working with people to get the
kind of feedback from "early adopters" that is needed to
mature drivers and tools so that they can be used by
"i am just a user"-type of folks. this is not a fast process
and it will take some effort to catch up.

bottom line: a) don't believe all the PR stuff that is thrown
at people as to how much faster things can become, but
get first hand experience and then a lot of things look
different. the IT industry has been able to fool people,
and especially people in research that have limited
technical experience for decades.
and b) remember that when something is technically
possible, it doesn't immediately translate into being
practically usable. any large scale scientific software
package can easily take 10 years from its initial inception
to becoming mature enough for widespread use. for
any disruptive technology to be integrated into such
a package you have to allow for at least half that amount
of time until it reaches a similar degree of maturity.


> ~Aron

Dr. Axel Kohlmeyer
College of Science and Technology
Temple University, Philadelphia PA, USA.

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 05:24:33 CST