AW: Scaling behaviour of NAMD on hosts with GPU accelerators

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Fri Mar 24 2017 - 08:25:52 CDT

You can keep a GPU context open with a „nvidia-smi --loop” construct.

 

Von: Kraus, Sebastian [mailto:sebastian.kraus_at_tu-berlin.de]
Gesendet: Freitag, 24. März 2017 14:14
An: Norman Geist <norman.geist_at_uni-greifswald.de>
Betreff: Re: namd-l: Scaling behaviour of NAMD on hosts with GPU
accelerators

 

Dear Mr. Geist,

 

>> Usually adding more GPUs will result in a somewhat linear speedup, if the
molecular system isn’t too small.

 

That's just the point. While enlarging the gridsize in a stepwise manner, I
monitored with nividia-smi the GPU power/memory usage. Now, I get some
meaningful results.

Unfortunately, the benchmark case, as given, is not big enough to test
scaling on a system with four big GTX 1080 cards, each equipped with about
2600 GPU cores. It is puzzling that the nvidia-smi command-line tool only
shows/detects attached GPU cards while processes are running on them. I am
working on an HPC-box without any graphical environment, so that the cards
are not constantly in use. Think that the cards suspend/change power state
as soon as load is off.

 

Thanks at lot for your help.

Best greetings

Sebastian Kraus

 

Technische Universität Berlin
Fakultät II
Institut für Chemie
Sekretariat C3
Straße des 17. Juni 135
10623 Berlin

Mitarbeiter Team IT am Institut für Chemie
Gebäude C, Straße des 17. Juni 115, Raum C7

Tel.: +49 30 314 22263
Fax: +49 30 314 29309
Email: sebastian.kraus_at_tu-berlin.de <mailto:sebastian.kraus_at_tu-berlin.de>

  _____

From: Norman Geist <norman.geist_at_uni-greifswald.de
<mailto:norman.geist_at_uni-greifswald.de> >
Sent: Friday, March 24, 2017 11:27
To: namd-l_at_ks.uiuc.edu <mailto:namd-l_at_ks.uiuc.edu> ; Kraus, Sebastian
Subject: AW: namd-l: Scaling behaviour of NAMD on hosts with GPU accelrators

 

You did not tell us something about your launching procedure. Please notice
that NAMD cannot use multiple GPUs per process. This means you need to use
an network enabled build of NAMD in order to start multiple processes (one
per GPU). The remaining cores can be used by SMP threads.

 

Usually adding more GPUs will result in a somewhat linear speedup, if the
molecular system isn’t too small.

 

Norman Geist

 

Von: owner-namd-l_at_ks.uiuc.edu <mailto:owner-namd-l_at_ks.uiuc.edu>
[mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag von Kraus, Sebastian
Gesendet: Donnerstag, 23. März 2017 17:18
An: namd-l_at_ks.uiuc.edu <mailto:namd-l_at_ks.uiuc.edu>
Betreff: namd-l: Scaling behaviour of NAMD on hosts with GPU accelrators

 

Hello,

I am about to benchmark NAMD on an Intel x86-64 SMP HPC box equipped with 20
cpu cores and a setup of four Nvidia GeForce GTX 1080 (Pascal) grahic
controllers/accelerator cards and decided to use the provided job example of
apoa1 as testcase. The general wall clock time for job runs of CUDA/SMP
hybrid-parallelized namd2 binaries with 5 to 20 processors varies in a range
of 3.5 to 8 mins.
I just observed that the runs of CUDA/SMP hybrid-parallelized namd2 binaries
with a single GPU card show a significant wall clock time reduction by a
factor of about 10 in comparision to wall clock times of runs with SMP-only
parallelized namd2 binaries.
Unfortunately, the runtime of namd2 does not scale any more while adding
further extension cards. However, the wall clock time of NAMD runs increases
slightly while adding more GPU devices. This eventually points to the fact,
that an increasing amount of communication overhead is generated based on
DevicetoHost and HosttoDevice operations while using more than one card.
Then, I tested whether binding/manual mapping of threads to CPU cores helps,
but this approach leads to a global deterioration of performance and
runtime.
Additionally, I profiled NAMD runs via nvprof/nvvp, but was not able to find
any valuable/helpful information about the global usage of GPU resources
(memory/GPU power) on each card. Only a timeline of the kernel runtimes can
be extracted, but this information does not help with the question whether
an acceleration card is fully or only partially loaded.
Does anyone have a valuable hint for me? How is it about the implementation
of load balancing in NAMD (source code)?

Best greetings

Sebastian Kraus

 

 

Technische Universität Berlin
Fakultät II
Institut für Chemie
Sekretariat C3
Straße des 17. Juni 135
10623 Berlin

Mitarbeiter Team IT am Institut für Chemie
Gebäude C, Straße des 17. Juni 115, Raum C7

Tel.: +49 30 314 22263
Fax: +49 30 314 29309
Email: sebastian.kraus_at_tu-berlin.de <mailto:sebastian.kraus_at_tu-berlin.de>

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2018 - 23:20:11 CST