Re: Slow performance over multi-core processor and CUDA build

From: Josh Vermaas (joshua.vermaas_at_gmail.com)
Date: Tue Jul 28 2020 - 14:28:59 CDT

Hi Roshan,

+idlepoll isn't required for CUDA builds since 2.11 or 2.12. My usual
command line for a single node build with GPUs looks something like this:

/path/to/namd/binary/namd2 +p8 input.namd | tee output.log

The processor count matters, since in the namd 2.X branch, the integrator
is on the CPU. If you decide to use the NAMD 3.0 alpha, which moves the
integrator to the GPU, the equivalent line would be:

/path/to/namd/binary/namd3 +p1 input.namd | tee output.log

And you'd want to add the CUDASOAIntegrate flag in your .namd file. The
NVIDIA developers blog has a recent post that details the changes you'd
need to make.
https://developer.nvidia.com/blog/delivering-up-to-9x-throughput-with-namd-v3-and-a100-gpu/

-Josh

On Tue, Jul 28, 2020 at 10:38 AM Roshan Shrestha <roshanpra_at_gmail.com>
wrote:

> Prof. Giacomo,
> So, if I use the newest nightly build of namd with
> Nvidia cuda acceleration, do I need to specify something in my command
> arguments like the number of processors with *+p8 *and +idlepoll or the
> normal *namd2 file.conf | tee output.log *shall work? Which is the best
> command I can use to have access to all cuda cores and the cpu cores? The
> thing with gromacs, was I had to build the source myself so that I can use
> cuda, whereas since namd seems like automate things, I am unable to grasp
> to understand how can I maximize its performance. For now, my system is
> pretty simple with 50K + atoms and the simulation parameters are pretty
> standard for normal equilibration and a production run. Thanks.
>
> With best regards
>
>
> On Tue, Jul 28, 2020 at 6:52 PM Giacomo Fiorin <giacomo.fiorin_at_gmail.com>
> wrote:
>
> > Not sure why hyperthreading is mentioned, which is not supported by the
> > processor in question:
> >
> >
> https://ark.intel.com/content/www/us/en/ark/products/186604/intel-core-i7-9700k-processor-12m-cache-up-to-4-90-ghz.html
> >
> > Roshan, what are the system size and simulation parameters? It is
> > possible that the system is not suitable for a CPU-GPU hybrid scheme
> > (possibly made worse by using too many CPU cores). The Gromacs benchmark
> > (which was probably run in single precision and on the CPU) seems to
> > suggest a rather small system. Have you tried running a non-GPU build?
> Or
> > the GPU-optimized 3.0 alpha build?
> >
> > For typical biological systems (of the order of 100,000 atoms) and
> running
> > over CPUs, Gromacs would be faster over a few nodes but scale over
> multiple
> > nodes less well than NAMD. The tipping point depends on the system and
> to
> > a lesser extent on the hardware makeup. I suggest you benchmark your
> > system thoroughly with both codes, and then decide.
> >
> > Giacomo
> >
> > On Tue, Jul 28, 2020 at 8:37 AM Norman Geist <
> > norman.geist_at_uni-greifswald.de> wrote:
> >
> >> I’d say don’t use hyperthreading in HPC in general, nothing special
> >> about GPUs. You can assign your tasks/threads to physical core only, e.g

This archive was generated by hypermail 2.1.6 : Thu Dec 31 2020 - 23:17:13 CST