From: John Stone (
Date: Wed Jan 28 2009 - 08:21:58 CST

  Thanks for the note. As you noticed, the existing code used
a simple static load balancing scheme that's only appropriate when
the GPUs are all of the same type. This originated from the fact that
the multicore CPU code in VMD uses this scheme as well, though in that
case any load imbalance between cores is handled by the OS scheduler
and so it was unnecessary to do anything further for the CPU code.

I've been intending on replacing the static load balancing scheme used
for the CUDA kernels with an implementation of a variation of one of the
"work stealing deque" type dynamic scheduling algorithms. I prefer
the "work stealing" approaches to a mutex-protected counter because
they typically have much less mutex contention, and it's often spread
over multiple mutexes rather than one.

For the electrostatic potential calculation code you're playing with,
"volmap coulomb" the single mutex protected slice counter works
pretty well because the computation rate for any interesting sized
problem is slow enough that the mutex doesn't get hit very hard,
so the patch you've written would likely work just fine for
this specific case. My concern is that for other cases, such as the
molecular orbital code, we might want the entire calculation to conclude
in a fraction of a second, so checking a single shared mutex could
easily become a significant bottleneck if the work is distributed in
fine-grained chunks. This is where I think that an implementation
of one of the "work stealing" approaches would really win out.

I don't have any test machines with such lopsided GPU SM counts, but
if you're willing to do some testing for me, I can likely do some
work on adding dynamic load balancing to the various GPU kernels so
they also work well on systems like yours.

  John Stone

On Wed, Jan 28, 2009 at 01:35:19PM +0100, Martin Aumüller wrote:
> Hi,
> when trying out the CUDA accelerated potential computation I ran into a
> problem with our hardware configuration: we have a Quadro FX 5800 (240 cores)
> and a Quadro NVS 290 (16 cores) in one workstation. I experienced a tremendous
> slow-down when using both CUDA devices: The even load distribution between all
> CUDA devices leads to unnecessarily long run times, as the slowest device has
> to do as much work as all the other devices and hence determines the total run
> time.
> I solved it by simply providing a mutex-protected global counter for the slice
> loop for all threads. As this is a rather coarse-grain load distribution
> scheme, I hope that the mutex does not lead to much overhead.
> I'd be happy if you can apply the attached patch to VMD.
> Regards,
> Martin

NIH Resource for Macromolecular Modeling and Bioinformatics
Beckman Institute for Advanced Science and Technology
University of Illinois, 405 N. Mathews Ave, Urbana, IL 61801
Email:                 Phone: 217-244-3349
  WWW:      Fax: 217-244-6078