Re: AW: Scaling behaviour of NAMD on hosts with GPU accelrators

From: Maxime Boissonneault (maxime.boissonneault_at_calculquebec.ca)
Date: Fri Mar 24 2017 - 06:42:04 CDT

The implementation and its scaling with # of GPU of NAMD definitely
changed between 2014-08-13 and recent versions.

With the version from 2014-08-13, I had the following numbers on the
Apoa1 benchmark, on rather beefy nodes (8 K20 and 20 CPU cores):

        Summary of results[edit
        <https://wiki.computecanada.ca/staff/index.php?title=TECC_Wiki:NAMD_ApoA1_GPU_Benchmark&action=edit&section=11>]

  * With 1 core and no GPU (with binding) : 1.31 s/step
  * With 1 node (20 cores) and no GPU : 0.07 s/step
  * With 1 core and 1 GPU : 0.10 s/step
  * With a quarter of a node (5 cores and 2 GPUs) : 0.026 s/step
  * With half of a node (10 cores a 4 GPUs) : 0.013 s/step
  * With a full node (20 cores and 8 GPUs) : 0.0075 s/step

It shows roughly a linear scaling by doubling the amount of resources
(GPU + CPU cores).

With NAMD from 20150917, you can find more detailed results in the .csv
attached. These results were obtained on an even beefier node (8 K80 (16
GPUs) + 24 CPU cores).

With NAMD ~2014, the optimal CPU core / GPU ratio was around 2.5, so it
was pretty easy to get more performance if you had multiple GPUs in a node.
With NAMD ~2016, the optimal CPU core / GPU ratio is now around 7, so it
is much harder to get more performance with many GPUs, since you also
need to increase the number of CPU cores.

However, the performance with a single GPU and 5 CPU cores went up
approximately by a factor of 2 between the two versions. (I know I'm
comparing different cores and different GPUs, but a half a K80 board is
not that much more powerful than a single K20).

Hope this helps,

Maxime Boissonneault

On 17-03-24 06:47, Norman Geist wrote:
>
> I see the same behavior for the Apoa1 benchmark, also for NAMD-2.11, I
> know that for older version e.g. 2.8/2.9 a had almost linear speedup
> when increasing the number of GPUs.
>
> This behavior might be related to recent optimizations of the CUDA
> kernels, or Apoa1 is just too small?
>
> Norman Geist
>
> *Von:*Norman Geist [mailto:norman.geist_at_uni-greifswald.de]
> *Gesendet:* Freitag, 24. März 2017 11:35
> *An:* namd-l_at_ks.uiuc.edu; 'Norman Geist' <norman.geist_at_uni-greifswald.de>
> *Betreff:* AW: namd-l: Scaling behaviour of NAMD on hosts with GPU
> accelrators
>
> Forget what I said. NAMD seems to actually can use multiple GPUs now,
> with a single Process.
>
> I’ll do some tests and see if I can find something…
>
> *Von:*owner-namd-l_at_ks.uiuc.edu <mailto:owner-namd-l_at_ks.uiuc.edu>
> [mailto:owner-namd-l_at_ks.uiuc.edu] *Im Auftrag von *Norman Geist
> *Gesendet:* Freitag, 24. März 2017 11:28
> *An:* namd-l_at_ks.uiuc.edu <mailto:namd-l_at_ks.uiuc.edu>; 'Kraus,
> Sebastian' <sebastian.kraus_at_tu-berlin.de
> <mailto:sebastian.kraus_at_tu-berlin.de>>
> *Betreff:* AW: namd-l: Scaling behaviour of NAMD on hosts with GPU
> accelrators
>
> You did not tell us something about your launching procedure. Please
> notice that NAMD cannot use multiple GPUs per process. This means you
> need to use an network enabled build of NAMD in order to start
> multiple processes (one per GPU). The remaining cores can be used by
> SMP threads.
>
> Usually adding more GPUs will result in a somewhat linear speedup, if
> the molecular system isn’t too small.
>
> Norman Geist
>
> *Von:*owner-namd-l_at_ks.uiuc.edu <mailto:owner-namd-l_at_ks.uiuc.edu>
> [mailto:owner-namd-l_at_ks.uiuc.edu] *Im Auftrag von *Kraus, Sebastian
> *Gesendet:* Donnerstag, 23. März 2017 17:18
> *An:* namd-l_at_ks.uiuc.edu <mailto:namd-l_at_ks.uiuc.edu>
> *Betreff:* namd-l: Scaling behaviour of NAMD on hosts with GPU accelrators
>
> Hello,
>
> I am about to benchmark NAMD on an Intel x86-64 SMP HPC box equipped
> with 20 cpu cores and a setup of four Nvidia GeForce GTX 1080 (Pascal)
> grahic controllers/accelerator cards and decided to use the provided
> job example of apoa1 as testcase. The general wall clock time for job
> runs of CUDA/SMP hybrid-parallelized namd2 binaries with 5 to 20
> processors varies in a range of 3.5 to 8 mins.
> I just observed that the runs of CUDA/SMP hybrid-parallelized namd2
> binaries with a single GPU card show a significant wall clock time
> reduction by a factor of about 10 in comparision to wall clock times
> of runs with SMP-only parallelized namd2 binaries.
> Unfortunately, the runtime of namd2 does not scale any more while
> adding further extension cards. However, the wall clock time of NAMD
> runs increases slightly while adding more GPU devices. This eventually
> points to the fact, that an increasing amount of communication
> overhead is generated based on DevicetoHost and HosttoDevice
> operations while using more than one card.
> Then, I tested whether binding/manual mapping of threads to CPU cores
> helps, but this approach leads to a global deterioration of
> performance and runtime.
> Additionally, I profiled NAMD runs via nvprof/nvvp, but was not able
> to find any valuable/helpful information about the global usage of GPU
> resources (memory/GPU power) on each card. Only a timeline of the
> kernel runtimes can be extracted, but this information does not help
> with the question whether an acceleration card is fully or only
> partially loaded.
> Does anyone have a valuable hint for me? How is it about the
> implementation of load balancing in NAMD (source code)?
>
>
> Best greetings
>
>
> Sebastian Kraus
>
>
>
>
> __ _
> _
> Technische Universität Berlin
> Fakultät II
> Institut für Chemie
> Sekretariat C3
> Straße des 17. Juni 135
> 10623 Berlin
>
> Mitarbeiter Team IT am Institut für Chemie
> Gebäude C, Straße des 17. Juni 115, Raum C7
>
>
> Tel.: +49 30 314 22263
> Fax: +49 30 314 29309
> Email: sebastian.kraus_at_tu-berlin.de <mailto:sebastian.kraus_at_tu-berlin.de>
>

-- 
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Président - Comité de coordination du soutien à la recherche de Calcul Québec
Team lead - Research Support National Team, Compute Canada
Instructeur Software Carpentry
Ph. D. en physique

This archive was generated by hypermail 2.1.6 : Sun Dec 31 2017 - 23:21:10 CST