Re: OpenCL and AMD GPUs

From: Aron Broom (broomsday_at_gmail.com)
Date: Thu Nov 24 2011 - 15:53:33 CST

Next message: jani vinod: "charmrun constantly hanging"
Previous message: Michelle Kuttel: "Re: Scaled forces, colvars and harmonic con/restraints"
In reply to: Axel Kohlmeyer: "Re: OpenCL and AMD GPUs"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

Yes thank you for clearing up that double precision issue, everything makes
quite a lot of sense now.

~Aron

On Wed, Nov 23, 2011 at 11:23 AM, Axel Kohlmeyer <akohlmey_at_gmail.com> wrote:

> On Wed, Nov 23, 2011 at 11:09 AM, Thomas Bishop <bishop_at_latech.edu> wrote:
> > Axel,
> > twice you have mentioned 4-way 8core systems.
> > Why not 4-way 16core AMD systems w/out any GPUS.
> > (this makes a nice farm instead of a cluster)
>
> because those interlagos CPUs have a different
> architecture, so that you won't get the expected
> performance unless applications are aware of it.
> they look good on paper, but apparently they need
> a kernel tweak to make the cpu caches work
> more effectively and you need a new-ish compiler
> to take advantage of the 4-way fused multiply-add
> and many other details. also the hardware topology
> is different. unlike most vendor sales droids,
> i only give recommendation on hardware that
> i know about from personal experience with
> the application at hand.
>
> i also mention the 8-core magny-cours instead
> of the 12-core since they 8-core ones are much
> cheaper _and_ run at higher clock, but both have
> the same memory bandwidth (which is - BTW -
> also the same for the 16 core interlagos, so
> they will be similarly restricted since MD does
> require a certain amount of memory bandwidth
> for traversing the neighbor lists all the time. after
> all the math is designed to be cheap. otherwise
> we all would be using morse potentials instead
> of lennard jones...).
>
> so at the moment, those new CPU machines
> are a big unknown, which may prove to be
> a worthy option in the future. but if you are
> on a budget, there are two rules that you
> have to obey. 1) never go for the top of the
> line model and 2) never trust a vendor
> recommendation. they don't care how something
> holds up in practical use, because none
> of them actually runs applications or has
> the time to do those tests.
>
> axel.
>
> >
> > But would 40Gbps IB keep up w/ such nodes
> > or are there other considerations w/ 4-way 16core nodes?
> >
> > THanks
> > Tom
> >
> > On 11/23/2011 09:52 AM, Axel Kohlmeyer wrote:
> >>
> >> On Wed, Nov 23, 2011 at 10:06 AM, Aron Broom<broomsday_at_gmail.com>
> wrote:
> >>>
> >>> @Nicholas,
> >>>
> >>>> I probably missed your point here, but on a Kentsfield Q6600 quad and
> >>>> using the ApoA1 test we see x5 acceleration using a GTX460 card (going
> >>>> from 0.44 ns/day without CUDA, to 2.22 ns/day with CUDA). Going five
> >>>> times
> >>>> faster is definitely not 'practically useless'. Having said that, for
> >>>> the
> >>>> needs of 'high powered computing initiatives' I'm not qualified to
> >>>> speak.
> >>>
> >>> That is actually more or less the improvement that I see with 2 M2070
> >>> cards
> >>> attached to a hexa-core xeon. That is quite surprising to me, and
> really
> >>> quite fantastic. What are the conditions of your system (size,
> >>> electrostatic cutoff, etc)? Certainly this completely negates my point
> >>> if
> >>> you can get equivalent acceleration from a consumer card.
> >>
> >> there are three issues to consider:
> >> 1) consumer cards tend to have a higher clock on memory and GPU.
> >> for classical MD the memory access speed pretty much dominant
> >> in terms of how fast MD can run. the best indicator for this is the
> >> performance of an MD code that run entirely on the GPU, e.g. HOOMD
> >> http://codeblue.umich.edu/hoomd-blue/benchmarks.html
> >>
> >> 2) Tesla cards with ECC enabled take a (~10%?) performance hit
> >> in terms of memory access speed, which is dominating performance
> >> for high end hardware.
> >>
> >> 3) it matters what kind of PCI-e connection bandwidth you have.
> >> quite a few "GPU-ready" cluster nodes (to be equipped with M20x0
> >> GPUs) have only 8x PCI-e slots. for a code like NAMD that has
> >> to transfer a lot of information across the PCI-e bus in every time
> >> step, that matters a lot, particularly when oversubscribing the GPU.
> >> it gets worse for S2070 quad GPU nodes, since the add PCI-e
> >> bridges where two GPUs have to share a PCI-e slot in the host.
> >>
> >>> @ Axel
> >>>
> >>>> to get something implemented "soon" in a scientific software package,
> >>>> you have basically two options: do it yourself, get the money to
> >>>> hire somebody to do it. there is a severe limit of people capable
> >>>> (and willing) to do this kind of work and unless somebody gets paid
> >>>> for doing something, there are very limited chances to implement
> >>>> something that already exists in some other shape or form
> >>>> regardless how much sense it makes to redo it.
> >>>
> >>> Yes, I see the point. Actually I've found that NAMD has quite a lot of
> >>> features, and it's really quite impressive how well it compares with
> >>> expensive packages like AMBER. I know of at least one initiative to
> make
> >>> a
> >>
> >> AMBER is *not* expensive. the performance of the amber MD codes
> >> (sander and pmemd) suffers from having to carry so much legacy
> >> around, since the amber package is much older than NAMD.
> >> just have a look at CHARMm which has been around for so much
> >> longer even. this is also one of the reasons why amber developers
> >> get a higher speedup from using GPUs, since their CPU reference
> >> is so much worse than the one in NAMD.
> >>
> >>> general CUDA->OpenCL code converter, done by Matt Harvey at Imperial
> >>> College
> >>> London, called Swan, but I'm not sure it would work in this case as a
> >>> high
> >>> degree of optimization is clearly needed.
> >>
> >> there are a ton of X->Y converters. none of them work perfectly.
> >>
> >> the better approach is to write code in a way that it can be
> >> easily retargeted. CUDA code (when used through the driver,
> >> not the runtime interfact) can be written and set up to look
> >> very similar to OpenCL code and vice versa. one of the two
> >> attempts to add GPU acceleration to LAMMPS, the one done
> >> by mike brown at ORNL uses that approach:
> >> http://users.nccs.gov/~wb8/geryon/index.htm
> >> and we very recently succeeded in getting AMD hardware
> >> (it was previously only tested with NVIDA hardware) working
> >> fairly well with OpenCL using the exact same GPU kernels
> >> mostly the same "glue code" plus the API abstraction from
> >> his geryon headers.
> >>
> >>>> why no improvement. with SHAKE+RATTLE you should be able to run with
> >>>> a 2fs time step. that is a serious improvement in my book.
> >>>
> >>> Yes, it's odd. When I move from "rigidbonds water" to "rigid bonds
> all",
> >>> and then go to a 2fs timestep, the speed per iteration slows down
> enough
> >>> that the overall benefit in terms of ns/day is ~10%. I'm not sure why
> >>> the
> >>> RATTLE algorithm is so costly in this case, it was using 2.8b2 though,
> >>> and I
> >>> haven't checked the release notes for the final 2.8 or b3, maybe there
> >>> was
> >>> an issue with RATTLE?.
> >>
> >> perhaps 2fs is requiring too many constraints iteration to achieve
> >> the desired accuracy. i would try 1.5fs then. the problem with
> >> shake is that it requires additional communication for
> >> each iteration and that may be a limiting factor, as your hardware
> >> seems to be limited in I/O bandwidth (which is a bad thing if
> >> you are after getting good GPU speedup, BTW).
> >>
> >>>> this statement is not correct. having the 1/8th double precision
> >>>> capability is fully sufficient for classical MD. the force computation
> >>>> can be and is done in single precision with very little loss of
> >>>> accuracy (the impact to accuracy similar to using a ~10 A cutoff
> >>>> instead of ~12 A on lennard jones interactions) for as long as
> >>>> the _accumulation_ of forces is done in double precision.
> >>>> please keep in mind that the current GPU support in the NAMD
> >>>> release version is still based on CUDA 2.x, which predates
> >>>> fermi hardware.
> >>>
> >>> I'm a bit confused by this, but would be quite happy if you were
> correct
> >>> (and Nicholas' response suggests you are). I gathered from the NAMD
> >>> website, and also from one of your discussions
> >>> (http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l/9166.html)
> that
> >>> all of the code was set in double precision mode (as opposed to
> packages
> >>
> >> no, you are misreading my post. and that was about the CPU
> >> version anyway and an older version of NAMD. the GPU version
> >> definitely uses single precision for the force computation.
> >>
> >>> like AMBER or GROMACS where this can be changed around). So when you
> say
> >>> that "the force computation can be and is done in single precision" do
> >>> you
> >>> mean that is the case for the NAMD code? If this is true then it would
> >>> explain why Nicholas sees the same degree of acceleration with a GTX460
> >>> as I
> >>> see with an M2070. But still, I find this at odds with other
> information
> >>> concerning NAMD and double precision.
> >>
> >> there is no conflict. GPU force kernels are different from CPU force
> >> kernels and only the non-bonded force computation is GPU accelerated
> >> in NAMD.
> >>
> >>>> first off, your assumptions about what kind of acceleration
> >>>> GPUs can provide relative to what you can achieve with
> >>>> current multi-core CPUs are highly exaggerated. you must
> >>>> not compare peak floating point performance.
> >>>
> >>> To be fair, I wasn't comparing GPU to CPU when I quoted all those peak
> >>> performance numbers, those were GPU to GPU. The only CPU to GPU
> numbers
> >>> I
> >>
> >> even those are not fair comparisons, since many factors
> >> affect performance, not just the peak performance. vendors
> >> like to push these numbers in your face, since this is often
> >> all that they have, but a practitioner always also asks, how
> >> well does it work at the end? how much more do i get done?
> >> in the current situation, GPU can give you a boost when
> >> running on a desktop and help when you need high throughput
> >> (particularly though folding_at_home and similar projects),
> >> but for capability calculations, i.e. where you want to run
> >> as fast as possible and don't mind how many nodes or
> >> cpus you use, CPUs are still unbeaten. GPUs still need
> >> too large chunks of work to be efficient to compete with that.
> >> perhaps in a few years when the integration of GPUs and
> >> memory into the CPU will be complete thing may be different.
> >>
> >>> mentioned was a 4-5 fold improvement from adding 2 M2070s to a 6-core
> >>> xeon,
> >>> and this is true in my case.
> >>>
> >>>> if you want
> >>>> to get the most bang for the buck, go with 4-way 8-core
> >>>> AMD magny-cours processor based machines (the new
> >>>> interlagos are priced temptingly and vendors are pushing
> >>>> them like crazy, but they don't provide as good performance
> >>>> with legacy binaries and require kernel upgrades for efficient
> >>>> scheduling and better cache handling). you can set up a
> >>>> kick-ass 32-core NAMD box for around $6,000 that will
> >>>> beat almost any combination of CPU/GPU hardware that
> >>>> you'll get for the same price and you won't have to worry
> >>>> about how to run well on that machine or whether an
> >>>> application is fully GPU accelerated.
> >>>
> >>> Thanks for the information, that is really quite interesting. I
> suppose
> >>> the
> >>> architecture here plays a very important role. I had compared against
> a
> >>> CPU
> >>
> >> yes.
> >>
> >>> cluster (that was several years old), and found that my 6-core 2 M2070
> >>> combination was the same speed as 96-CPU-cores. But again, those CPUs
> >>> were
> >>> a bit older and I can imagine that having each CPU be 8-core on its own
> >>> (I
> >>> think this cluster was of dual-core chips) would improve scaling.
> >>
> >> not scaling, but overall performance. many-core CPUs actually
> >> are a challenge to scaling, since you can quickly run into communication
> >> contention issues. this is why NAMD has now hybrid SMP binaries
> >> and also other code (i am providing this for LAMMPS) have moved
> >> to hybrid MPI+Threading parallelization schemes for better scaling
> >> across larger numbers of multi-core CPUs. the nice thing about the
> >> 4-way 8-core CPU machines is, that their overall performance is often
> >> sufficient for running routine NAMD ("capacity") calculations when
> >> using only a single node. so not need to purchase an expensive
> >> high-speed network, you don't even need racks. just a bunch of
> >> these work stations tucked away in a closet. for bigger jobs one
> >> should then move to proper clusters.
> >>
> >>>> bottom line: a) don't believe all the PR stuff that is thrown
> >>>> at people as to how much faster things can become, but
> >>>> get first hand experience and then a lot of things look
> >>>> different. the IT industry has been able to fool people,
> >>>> and especially people in research that have limited
> >>>> technical experience for decades.
> >>>> and b) remember that when something is technically
> >>>> possible, it doesn't immediately translate into being
> >>>> practically usable. any large scale scientific software
> >>>> package can easily take 10 years from its initial inception
> >>>> to becoming mature enough for widespread use. for
> >>>> any disruptive technology to be integrated into such
> >>>> a package you have to allow for at least half that amount
> >>>> of time until it reaches a similar degree of maturity.
> >>>
> >>> Yes, I quite agree with a), the initial hype was that GPUs would give
> >>> orders
> >>> of magnitude faster performance, and while they are certainly nice, it
> >>> has
> >>
> >> well, they do - for certain problems and for the accelerated
> >> kernel only. but much of that added performance is lost
> >> when running real applications thanks to amdahl's law.
> >>
> >>> not been as mind-boggling as originally suggested. In terms of b),
> that
> >>> is
> >>> a useful bit of information that I'll keep in mind. As I mostly just
> >>> write
> >>> scripts to simplify my work I've really no understanding of the
> workings
> >>> of
> >>> making a large and complicated software package. Thanks a lot for the
> >>> responses, very informative, and I'd really like to know about this
> >>> double
> >>> precision thing.
> >>
> >> that should be taken care of now, i hope.
> >>
> >> cheers,
> >> axel.
> >>
> >>> ~Aron
> >>>
> >>
> >>
> >
> >
> > --
> > *******************************
> > Thomas C. Bishop
> > Tel: 318-257-5209
> > Fax: 318-257-3823
> > http://dna.engr.latech.edu
> > ********************************
> >
> >
>
>
>
> --
> Dr. Axel Kohlmeyer
> akohlmey_at_gmail.com http://goo.gl/1wk0
>
> College of Science and Technology
> Temple University, Philadelphia PA, USA.
>

Next message: jani vinod: "charmrun constantly hanging"
Previous message: Michelle Kuttel: "Re: Scaled forces, colvars and harmonic con/restraints"
In reply to: Axel Kohlmeyer: "Re: OpenCL and AMD GPUs"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:02 CST