Scaling behaviour of NAMD on hosts with GPU accelrators

From: Kraus, Sebastian (sebastian.kraus_at_tu-berlin.de)
Date: Thu Mar 23 2017 - 11:18:14 CDT

Next message: Jeff Comer: "Re: Scaling behaviour of NAMD on hosts with GPU accelrators"
Previous message: Wasut Pornpatcharapong: "Re: Explanation of parameters for NBTABLE's tabulated external file?"
Next in thread: Jeff Comer: "Re: Scaling behaviour of NAMD on hosts with GPU accelrators"
Reply: Jeff Comer: "Re: Scaling behaviour of NAMD on hosts with GPU accelrators"
Reply: Norman Geist: "AW: Scaling behaviour of NAMD on hosts with GPU accelrators"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

Hello,

I am about to benchmark NAMD on an Intel x86-64 SMP HPC box equipped with 20 cpu cores and a setup of four Nvidia GeForce GTX 1080 (Pascal) grahic controllers/accelerator cards and decided to use the provided job example of apoa1 as testcase. The general wall clock time for job runs of CUDA/SMP hybrid-parallelized namd2 binaries with 5 to 20 processors varies in a range of 3.5 to 8 mins.
I just observed that the runs of CUDA/SMP hybrid-parallelized namd2 binaries with a single GPU card show a significant wall clock time reduction by a factor of about 10 in comparision to wall clock times of runs with SMP-only parallelized namd2 binaries.
Unfortunately, the runtime of namd2 does not scale any more while adding further extension cards. However, the wall clock time of NAMD runs increases slightly while adding more GPU devices. This eventually points to the fact, that an increasing amount of communication overhead is generated based on DevicetoHost and HosttoDevice operations while using more than one card.
Then, I tested whether binding/manual mapping of threads to CPU cores helps, but this approach leads to a global deterioration of performance and runtime.
Additionally, I profiled NAMD runs via nvprof/nvvp, but was not able to find any valuable/helpful information about the global usage of GPU resources (memory/GPU power) on each card. Only a timeline of the kernel runtimes can be extracted, but this information does not help with the question whether an acceleration card is fully or only partially loaded.
Does anyone have a valuable hint for me? How is it about the implementation of load balancing in NAMD (source code)?

Best greetings

Sebastian Kraus

Technische Universität Berlin
Fakultät II
Institut für Chemie
Sekretariat C3
Straße des 17. Juni 135
10623 Berlin

Mitarbeiter Team IT am Institut für Chemie
Gebäude C, Straße des 17. Juni 115, Raum C7

Tel.: +49 30 314 22263
Fax: +49 30 314 29309
Email: sebastian.kraus_at_tu-berlin.de

Next message: Jeff Comer: "Re: Scaling behaviour of NAMD on hosts with GPU accelrators"
Previous message: Wasut Pornpatcharapong: "Re: Explanation of parameters for NBTABLE's tabulated external file?"
Next in thread: Jeff Comer: "Re: Scaling behaviour of NAMD on hosts with GPU accelrators"
Reply: Jeff Comer: "Re: Scaling behaviour of NAMD on hosts with GPU accelrators"
Reply: Norman Geist: "AW: Scaling behaviour of NAMD on hosts with GPU accelrators"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Sun Dec 31 2017 - 23:21:10 CST