From: Kraus, Sebastian (sebastian.kraus_at_tu-berlin.de)
Date: Thu Mar 23 2017 - 11:18:14 CDT
I am about to benchmark NAMD on an Intel x86-64 SMP HPC box equipped with 20 cpu cores and a setup of four Nvidia GeForce GTX 1080 (Pascal) grahic controllers/accelerator cards and decided to use the provided job example of apoa1 as testcase. The general wall clock time for job runs of CUDA/SMP hybrid-parallelized namd2 binaries with 5 to 20 processors varies in a range of 3.5 to 8 mins.
I just observed that the runs of CUDA/SMP hybrid-parallelized namd2 binaries with a single GPU card show a significant wall clock time reduction by a factor of about 10 in comparision to wall clock times of runs with SMP-only parallelized namd2 binaries.
Unfortunately, the runtime of namd2 does not scale any more while adding further extension cards. However, the wall clock time of NAMD runs increases slightly while adding more GPU devices. This eventually points to the fact, that an increasing amount of communication overhead is generated based on DevicetoHost and HosttoDevice operations while using more than one card.
Then, I tested whether binding/manual mapping of threads to CPU cores helps, but this approach leads to a global deterioration of performance and runtime.
Additionally, I profiled NAMD runs via nvprof/nvvp, but was not able to find any valuable/helpful information about the global usage of GPU resources (memory/GPU power) on each card. Only a timeline of the kernel runtimes can be extracted, but this information does not help with the question whether an acceleration card is fully or only partially loaded.
Does anyone have a valuable hint for me? How is it about the implementation of load balancing in NAMD (source code)?
Technische Universität Berlin
Institut für Chemie
Straße des 17. Juni 135
Mitarbeiter Team IT am Institut für Chemie
Gebäude C, Straße des 17. Juni 115, Raum C7
Tel.: +49 30 314 22263
Fax: +49 30 314 29309
This archive was generated by hypermail 2.1.6 : Mon Dec 31 2018 - 23:20:11 CST