From: John Stone (johns_at_ks.uiuc.edu)
Date: Wed Sep 02 2015 - 10:03:38 CDT

Hi,
  The current ILS code was written back in 2009.
The ILS GPU kernels have several hard limits that arise
from the size of some performance-critical data structures
that are put into tiny areas of very fast on-chip memory.

On the most recent GPUs (e.g. GeForce 980s and later) it would
likely be possible to relax some of the ILS hard-limits because
the new hardware architectures have much greater flexibility in
terms of caching read-only data. This would require writing
new GPU kernels and exploring the design trade-offs in current
hardware.

With significant work, the ILS algorithms could be parallelized
much more on both CPU and GPU (thereby allowing multi-GPU runs),
but this would require a big time investment.

ILS isn't currently a VMD development focus because the postdoc
that was doing the development of the ILS code left for industry
and there hasn't been new ILS development work here since that time.

I would suggest using the ILS tools as they are now and see if
they can work for you. If more people use these features of VMD,
there would be more motivation to revisit the performance of
the algorithms currently implemented at some later time.

Cheers,
  John Stone
  vmd_at_ks.uiuc.edu

On Wed, Sep 02, 2015 at 10:42:03AM -0400, Simon Dürr wrote:
> Hi all,
>
> I'm trying to run an ILS calculation on a GPU Cluster.
> The stats of the system I'm using:
> - Scientific Linux 6.1
> - 24gb RAM
> - 8 CPUs
> - 7 GPUs (NVidia GeForce GTX 580 with 1.5gb RAM)
>
> I use VMD 1.9.1 (OpenGL/CUDA enabled) and CUDA Driver v6.5
>
> My system has 60.000 atoms and is a 10ns equilibration from NAMD 2.9.
> The dcd contains a frame for each ps.
>
> When I try to run ILS with oxygen and a subres greater than 1 I cannot
> use CUDA for acceleration of the computing (".....max_binoffsets
> exceeded, using CPU....").
> Also it seems I'm using only one GPU not all 7 available ones. VMD
> detects them all and when i set the subres to 1 the calculation uses
> CUDA but only on GPU [0] (frametime ~20sec).
>
> My Questions:
> Is it possible to use all GPUs for the calculation?
> Is it possible that the memory of the GPUs is not sufficient to
> accelerate this calculation with CUDA when subres >= 2 ?
> Is there any way to circumvent this?
>
>
> See parts of the log below
>
> Cheers,
> Simon
>
> Info) Multithreading available, 8 CPUs detected.
> Info) Free system memory: 21775MB (90%)
> Info) Creating CUDA device pool and initializing hardware...
> Info) Detected 7 available CUDA accelerators:
> Info) [0] GeForce GTX 580 16 SM_2.0 @ 1.59 GHz, 1.5GB RAM, OIO, ZCP
> Info) [1] GeForce GTX 580 16 SM_2.0 @ 1.59 GHz, 1.5GB RAM, OIO, ZCP
> Info) [2] GeForce GTX 580 16 SM_2.0 @ 1.59 GHz, 1.5GB RAM, OIO, ZCP
> Info) [3] GeForce GTX 580 16 SM_2.0 @ 1.59 GHz, 1.5GB RAM, OIO, ZCP
> Info) [4] GeForce GTX 580 16 SM_2.0 @ 1.59 GHz, 1.5GB RAM, OIO, ZCP
> Info) [5] GeForce GTX 580 16 SM_2.0 @ 1.59 GHz, 1.5GB RAM, OIO, ZCP
> Info) [6] GeForce GTX 580 16 SM_2.0 @ 1.59 GHz, 1.5GB RAM, OIO, ZCP
>
> Info) ILS frame 1/10000
> Info) Coord setup: 0.088089 s
> Info) Aligning frames.
> Info) ComputeOccupancyMap_setup() 0.016058 s
> Using CUDA device: 0
> ***** ERROR: Exceeded MAX_BINOFFSETS for CUDA kernel
> Info) vmd_cuda_evaluate_occupancy_map() FAILED, using CPU for calculation
> Info) ComputeOccupancyMap: find_distance_exclusions() 0.232957 s
> Info) ComputeOccupancyMap: find_energy_exclusions()
> 329.343837 s -> 6690550 exclusions
> Info) ComputeOccupancyMap: compute_occupancy_multiatom() 303.300753 s
> Info) ComputeOccupancyMap_calculate_slab() 632.882040 s
> Info) Total frame time = 632.988717 s

-- 
NIH Center for Macromolecular Modeling and Bioinformatics
Beckman Institute for Advanced Science and Technology
University of Illinois, 405 N. Mathews Ave, Urbana, IL 61801
http://www.ks.uiuc.edu/~johns/           Phone: 217-244-3349
http://www.ks.uiuc.edu/Research/vmd/