performance of GPU calculations

From: Hao Dong (donghaonj_at_gmail.com)
Date: Sun Nov 06 2016 - 21:11:40 CST

Hello Everyone,

I am running NPT simulations with the total number of 139000 atoms in the
system (PBC, PME). The "CVS-2016-03-16 Linux-x86_64-multicore-CUDA" was
used. And I tested the performance of two GPU machines, with the
information listed as follows:

machine_1,
CPU: 2 Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz, 10 cores on each cpu;
GPU: 2 GTX 1080 (driver version: 367.35)

machine_2,
CPU: 2 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 8 cores on each cpu;
GPU: 2 Titan-x (driver version: 352.30)

Memory is 64 GB DDR4 for both machines.

For machine_1, I tested the following three jobs (4-ns simulations for
each):
(1a) namd2 +idlepoll +devices 0,1 +p20 job.conf >& job.log
(1b) namd2 +idlepoll +devices 0,1 +p32 job.conf >& job.log
(1c) namd2 +idlepoll +devices 0,1 +p40 job.conf >& job.log

I got the following Benchmark time and cpu time:
(1a) 0.135031 days/ns, 6.27 hours
(1b) 0.103945 days/ns, 5.33 hours
(1c) 0.095529 days/ns, 7.22 hours

for machine_2, the following command line was used, and I got 0.104825
days/ns, and 5.33 hours
(2a) namd2 +idlepoll +devices 0,1 +p32 job.conf >& job.log

Here are my questions:
(1) Regarding the CPU time, it seems that hyperthreading on machine_1 can
firstly increase (from 20 to 32 cores) the performance of GPU calculatins,
but fully hyperthreading significantly decrease the performance (from 32 to
40 cores). However, this cannot be reflected from the "Benchmark Time" data
from the NAMD output. I read the post by Dr Donald Kinghorn to evaluate
different GPU cards (https://www.pugetsystems.com/labs/hpc/NAMD-Molecular-
Dynamics-Performance-on-NVIDIA-GTX-1080-and-1070-GPU-815/). His accessment
is based on the "Benchmark Time". But it seems that this is not correct in
my case.

(2) For both machines, the GPU-ulitiy is always ~30%. How can I improve the
GPU-ulitily?

(3) Based on the comparison between job-1b and job-2a, the performance of
GTX-1080 and Titan-x is similar (but machine_2 without hyper threading
(only use 16 cores) could be even faster?), and the number of CPU cores
seems to be not so critical. Is there any suggestion for building a better
machine for running classical MD simulations for systems of 100K-200K
atoms? (K80 is too expensive).

Any comments is highly appreciated!

Hao

This archive was generated by hypermail 2.1.6 : Tue Dec 27 2016 - 23:22:34 CST