Version date:  3/10/2007

Coulombic potential grid microbenchmark test codes
--------------------------------------------------
These are simplified versions of the coulombic potential
code used in some of our molecular modeling and analysis 
tools, specifically 'cionize' and 'VMD':
  http://www.ks.uiuc.edu/Research/vmd/

These test codes were written using the CUDA 0.8 beta SDK,
and as such they contain a few unusual coding idioms 
which cause the compiler to generate the code we want, though we
migh prefer that the optimizer do some of these things for us:
  foo = 1.0f / sqrt(bar);
  energy += charge * foo;
where one might prefer to write:
  energy += charge / sqrt(bar);

GFLOPS note: All GFLOPS numbers count FMAD as 2 ops, all other FP ops
  are counted as one.  Counting FMAD as two ops seems fair
  since not all architectures have it, and it relates directly
  to the C code.

Block/grid size notes:
  The block grid sizes can be adjusted, the values used in the
code included in this directory give good performance for each
formulation.  Some are more tolerant to adjustment and will perform
equally well with various block sizes, others are sensitive and will
only achieve peak performance for one or two size combinations.


Performance oriented test kernels: (sorted by speed)
----------------------------------
cuenergytex:
  Stores all of the atoms in a 2-D texture, loops over the whole
  plane in one pass.  Performance isn't as good as the constant buffer
  methods below, because it isn't unrolling the X loop and using registers
  to reduce the rate of loads.  Since the first test didn't do very well
  relative to the other methods, I haven't pursued it any further (yet). 
  I'm sure that this could be greatly improved if I understood more about
  the performance factors affecting texture fetches and their interactions
  with registers and shared memory, and certainly by using some of the same
  loop unrolling done in subsequent codes below.
  90 GFLOPS
  9 billion atom evals/sec  

cuenergyconstpre:
  Simplest GPU implementation, uses the G80 64K constant memory
  to store 4070 atoms at a time, calculating the colombic potential
  by summing the per-atom contributions one voxel at a time.
  Since all threads read the same constant value at the same time,
  we get good performance.  When the constant buffer is filled,
  the dz^2 value is precalculated, saving work.
  This was an easy version to write, and 
  150 GFLOPS
  16.7 billion atom evals/sec

cuenergyconstprevec4:
  This version is a simple extension of cuenergyconstpre which unrolls
  the X loop processing four voxels at a time, reducing some redundant
  arithmetic, and improving the ratio of FP arithmetic to FP loads.
  This version uses many more registers, but achieves much higher 
  performance.
  226 GFLOPS
  33 billion atom evals/sec
   
cuenergyshared8:
  This is a variation of cuenergyconstprevec4 which uses fewer registers
  by storing incoming voxel potentials in shared memory rather than in
  registers.  The savings in registers provides a slight speed bump over
  the other version, but may still leave a fair bit of room for improvement.
  235 GFLOPS
  34.8 billion atom evals/sec                



Accuracy oriented test kernels:
-------------------------------
cuenergycompsum:
  Variation of cuenergyconstprevec4, implementing
  Kahan's compensated summation method for improved
  accuracy when summing millions of values.
  138 GFLOPS (On the science...  More GFLOPS are being performed,
              but they are done for compensated summation so we're not
              counting them here...)
  20.5 billion atom evals/sec                



