AW: Scaling behaviour of NAMD on hosts with GPU accelrators

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Fri Mar 24 2017 - 05:27:46 CDT

Next message: Norman Geist: "AW: Scaling behaviour of NAMD on hosts with GPU accelrators"
Previous message: Victor Ovchinnikov: "Re:"
In reply to: Kraus, Sebastian: "Scaling behaviour of NAMD on hosts with GPU accelrators"
Next in thread: Norman Geist: "AW: Scaling behaviour of NAMD on hosts with GPU accelrators"
Reply: Norman Geist: "AW: Scaling behaviour of NAMD on hosts with GPU accelrators"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

You did not tell us something about your launching procedure. Please notice
that NAMD cannot use multiple GPUs per process. This means you need to use
an network enabled build of NAMD in order to start multiple processes (one
per GPU). The remaining cores can be used by SMP threads.

Usually adding more GPUs will result in a somewhat linear speedup, if the
molecular system isn’t too small.

Norman Geist

Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag
von Kraus, Sebastian
Gesendet: Donnerstag, 23. März 2017 17:18
An: namd-l_at_ks.uiuc.edu
Betreff: namd-l: Scaling behaviour of NAMD on hosts with GPU accelrators

Hello,

I am about to benchmark NAMD on an Intel x86-64 SMP HPC box equipped with 20
cpu cores and a setup of four Nvidia GeForce GTX 1080 (Pascal) grahic
controllers/accelerator cards and decided to use the provided job example of
apoa1 as testcase. The general wall clock time for job runs of CUDA/SMP
hybrid-parallelized namd2 binaries with 5 to 20 processors varies in a range
of 3.5 to 8 mins.
I just observed that the runs of CUDA/SMP hybrid-parallelized namd2 binaries
with a single GPU card show a significant wall clock time reduction by a
factor of about 10 in comparision to wall clock times of runs with SMP-only
parallelized namd2 binaries.
Unfortunately, the runtime of namd2 does not scale any more while adding
further extension cards. However, the wall clock time of NAMD runs increases
slightly while adding more GPU devices. This eventually points to the fact,
that an increasing amount of communication overhead is generated based on
DevicetoHost and HosttoDevice operations while using more than one card.
Then, I tested whether binding/manual mapping of threads to CPU cores helps,
but this approach leads to a global deterioration of performance and
runtime.
Additionally, I profiled NAMD runs via nvprof/nvvp, but was not able to find
any valuable/helpful information about the global usage of GPU resources
(memory/GPU power) on each card. Only a timeline of the kernel runtimes can
be extracted, but this information does not help with the question whether
an acceleration card is fully or only partially loaded.
Does anyone have a valuable hint for me? How is it about the implementation
of load balancing in NAMD (source code)?

Best greetings

Sebastian Kraus

Technische Universität Berlin
Fakultät II
Institut für Chemie
Sekretariat C3
Straße des 17. Juni 135
10623 Berlin

Mitarbeiter Team IT am Institut für Chemie
Gebäude C, Straße des 17. Juni 115, Raum C7

Tel.: +49 30 314 22263
Fax: +49 30 314 29309
Email: sebastian.kraus_at_tu-berlin.de <mailto:sebastian.kraus_at_tu-berlin.de>

Next message: Norman Geist: "AW: Scaling behaviour of NAMD on hosts with GPU accelrators"
Previous message: Victor Ovchinnikov: "Re:"
In reply to: Kraus, Sebastian: "Scaling behaviour of NAMD on hosts with GPU accelrators"
Next in thread: Norman Geist: "AW: Scaling behaviour of NAMD on hosts with GPU accelrators"
Reply: Norman Geist: "AW: Scaling behaviour of NAMD on hosts with GPU accelrators"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Sun Dec 31 2017 - 23:21:10 CST