AW: Scaling behaviour of NAMD on hosts with GPU accelrators

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Fri Mar 24 2017 - 05:35:00 CDT

Forget what I said. NAMD seems to actually can use multiple GPUs now, with a
single Process.

 

I’ll do some tests and see if I can find something…

 

Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag
von Norman Geist
Gesendet: Freitag, 24. März 2017 11:28
An: namd-l_at_ks.uiuc.edu; 'Kraus, Sebastian' <sebastian.kraus_at_tu-berlin.de>
Betreff: AW: namd-l: Scaling behaviour of NAMD on hosts with GPU accelrators

 

You did not tell us something about your launching procedure. Please notice
that NAMD cannot use multiple GPUs per process. This means you need to use
an network enabled build of NAMD in order to start multiple processes (one
per GPU). The remaining cores can be used by SMP threads.

 

Usually adding more GPUs will result in a somewhat linear speedup, if the
molecular system isn’t too small.

 

Norman Geist

 

Von: owner-namd-l_at_ks.uiuc.edu <mailto:owner-namd-l_at_ks.uiuc.edu>
[mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag von Kraus, Sebastian
Gesendet: Donnerstag, 23. März 2017 17:18
An: namd-l_at_ks.uiuc.edu <mailto:namd-l_at_ks.uiuc.edu>
Betreff: namd-l: Scaling behaviour of NAMD on hosts with GPU accelrators

 

Hello,

I am about to benchmark NAMD on an Intel x86-64 SMP HPC box equipped with 20
cpu cores and a setup of four Nvidia GeForce GTX 1080 (Pascal) grahic
controllers/accelerator cards and decided to use the provided job example of
apoa1 as testcase. The general wall clock time for job runs of CUDA/SMP
hybrid-parallelized namd2 binaries with 5 to 20 processors varies in a range
of 3.5 to 8 mins.
I just observed that the runs of CUDA/SMP hybrid-parallelized namd2 binaries
with a single GPU card show a significant wall clock time reduction by a
factor of about 10 in comparision to wall clock times of runs with SMP-only
parallelized namd2 binaries.
Unfortunately, the runtime of namd2 does not scale any more while adding
further extension cards. However, the wall clock time of NAMD runs increases
slightly while adding more GPU devices. This eventually points to the fact,
that an increasing amount of communication overhead is generated based on
DevicetoHost and HosttoDevice operations while using more than one card.
Then, I tested whether binding/manual mapping of threads to CPU cores helps,
but this approach leads to a global deterioration of performance and
runtime.
Additionally, I profiled NAMD runs via nvprof/nvvp, but was not able to find
any valuable/helpful information about the global usage of GPU resources
(memory/GPU power) on each card. Only a timeline of the kernel runtimes can
be extracted, but this information does not help with the question whether
an acceleration card is fully or only partially loaded.
Does anyone have a valuable hint for me? How is it about the implementation
of load balancing in NAMD (source code)?

Best greetings

Sebastian Kraus

 

 

Technische Universität Berlin
Fakultät II
Institut für Chemie
Sekretariat C3
Straße des 17. Juni 135
10623 Berlin

Mitarbeiter Team IT am Institut für Chemie
Gebäude C, Straße des 17. Juni 115, Raum C7

Tel.: +49 30 314 22263
Fax: +49 30 314 29309
Email: sebastian.kraus_at_tu-berlin.de <mailto:sebastian.kraus_at_tu-berlin.de>

This archive was generated by hypermail 2.1.6 : Sun Dec 31 2017 - 23:21:10 CST