From: Martin Aumüller (aumueller_at_uni-koeln.de)
Date: Wed Jan 28 2009 - 06:35:19 CST

Hi,

when trying out the CUDA accelerated potential computation I ran into a
problem with our hardware configuration: we have a Quadro FX 5800 (240 cores)
and a Quadro NVS 290 (16 cores) in one workstation. I experienced a tremendous
slow-down when using both CUDA devices: The even load distribution between all
CUDA devices leads to unnecessarily long run times, as the slowest device has
to do as much work as all the other devices and hence determines the total run
time.

I solved it by simply providing a mutex-protected global counter for the slice
loop for all threads. As this is a rather coarse-grain load distribution
scheme, I hope that the mutex does not lead to much overhead.

I'd be happy if you can apply the attached patch to VMD.

Regards,
Martin