Comparing GPU-resident to GPU-offload on DHFR (JAC) benchmark

Benchmarking for A100 GPUs on TCB puck
(HGX-A100: 4x A100-SXM4-40GB, 2x AMD EPYC 74F3 24-Core Processor)

GPU-resident performed best launching 2 cores:
$NAMD +p2 +setcpuaffinity +devices 0 dhfr_gpures_amber_nve.namd

GPU-offload performed best launcing all cores:
$NAMD +p48 +setcpuaffinity +devices 0 dhfr_gpuoff_amber_nve.namd

The "amber" config files use 9A cutoff for non-bonded as done by AMBER FF.

The "hmr" input files are hydrogen-mass repartitioning (HMR) for the
standard system, allowing 4fs time stepping with rigid bond constraints.
These files can be generated through psfgen on the hmr.tcl file, but 
it has to be a true version 2.0 build of Psfgen (the one from VMD).

