Re: Scaling behaviour of NAMD on hosts with GPU accelrators

From: Jeff Comer (jeffcomer_at_gmail.com)
Date: Thu Mar 23 2017 - 12:24:23 CDT

Some NAMD performance graphs, which may or may not be helpful are
available on my website:

http://jeffcomer.us/downloads.html

–––––––––––––––––––––––––––––––––––———————
Jeffrey Comer, PhD
Assistant Professor
Institute of Computational Comparative Medicine
Nanotechnology Innovation Center of Kansas State
Kansas State University
Office: P-213 Mosier Hall
Phone: 785-532-6311
Website: http://jeffcomer.us

On Thu, Mar 23, 2017 at 11:18 AM, Kraus, Sebastian
<sebastian.kraus_at_tu-berlin.de> wrote:
> Hello,
>
> I am about to benchmark NAMD on an Intel x86-64 SMP HPC box equipped with 20
> cpu cores and a setup of four Nvidia GeForce GTX 1080 (Pascal) grahic
> controllers/accelerator cards and decided to use the provided job example of
> apoa1 as testcase. The general wall clock time for job runs of CUDA/SMP
> hybrid-parallelized namd2 binaries with 5 to 20 processors varies in a range
> of 3.5 to 8 mins.
> I just observed that the runs of CUDA/SMP hybrid-parallelized namd2 binaries
> with a single GPU card show a significant wall clock time reduction by a
> factor of about 10 in comparision to wall clock times of runs with SMP-only
> parallelized namd2 binaries.
> Unfortunately, the runtime of namd2 does not scale any more while adding
> further extension cards. However, the wall clock time of NAMD runs increases
> slightly while adding more GPU devices. This eventually points to the fact,
> that an increasing amount of communication overhead is generated based on
> DevicetoHost and HosttoDevice operations while using more than one card.
> Then, I tested whether binding/manual mapping of threads to CPU cores helps,
> but this approach leads to a global deterioration of performance and
> runtime.
> Additionally, I profiled NAMD runs via nvprof/nvvp, but was not able to find
> any valuable/helpful information about the global usage of GPU resources
> (memory/GPU power) on each card. Only a timeline of the kernel runtimes can
> be extracted, but this information does not help with the question whether
> an acceleration card is fully or only partially loaded.
> Does anyone have a valuable hint for me? How is it about the implementation
> of load balancing in NAMD (source code)?
>
>
> Best greetings
>
>
> Sebastian Kraus
>
>
>
>
>
>
>
> Technische Universität Berlin
> Fakultät II
> Institut für Chemie
> Sekretariat C3
> Straße des 17. Juni 135
> 10623 Berlin
>
> Mitarbeiter Team IT am Institut für Chemie
> Gebäude C, Straße des 17. Juni 115, Raum C7
>
>
> Tel.: +49 30 314 22263
> Fax: +49 30 314 29309
> Email: sebastian.kraus_at_tu-berlin.de
>
>
>

This archive was generated by hypermail 2.1.6 : Sun Dec 31 2017 - 23:21:10 CST