SMP instructions for multi-node calculation on POWER 8 + P100

From: Adrien Cerdan (acerdan_at_unistra.fr)
Date: Wed Nov 22 2017 - 10:09:51 CST

Dear all,

I recently got access to a HPC with POWER8+ NVIDIA Tesla P100 (S822LC).

In order to benefit the most from latest NAMD features I compiled a
nightly-build "Linux-POWER-g++-verbs-smp-cuda".

My system is about 200,000 atoms and on one node I got decent
performances (_0.0065 s/step_, ~27 ns/day) using all four GPUs and 2
threads/core, in LSF environment:

#1 node :

#BSUB -n 1
#BSUB -R "span[ptile=1]"
#BSUB -R 'rusage[ngpus_shared=1]'
$charmrun_bin ++verbose ++scalable-start ++mpiexec ++p 40 ++ppn 40
$namd2_bin +idlepoll +setcpuaffinity +pemap 0-159:8.2 prod01.namd >
prod01.out

_0.0065 s/step

_In a first attempt I just increased the number of nodes:

#2 nodes: _

_#BSUB -n 2
#BSUB -R "span[ptile=1]"
#BSUB -R 'rusage[ngpus_shared=1]'
$charmrun_bin ++verbose ++scalable-start ++mpiexec  ++p 80 ++ppn 40
$namd2_bin +idlepoll +setcpuaffinity +pemap 0-159:8.2 prod01.namd >
prod01.out

_0.0075 s/step
_
#4 nodes: _

_ #BSUB -n 4
#BSUB -R "span[ptile=1]"
#BSUB -R 'rusage[ngpus_shared=1]'
$charmrun_bin ++verbose ++scalable-start ++mpiexec  ++p 160 ++ppn 40 
$namd2_bin +idlepoll +setcpuaffinity +pemap 0-159:8.2 prod01.namd >
prod01.out

_0.0060 s/step
_
Unsurprisingly it leads to bad performances ... According to NAMD
documentation I should go for "one process per GPU and as many threads
as available cores, reserving one core per process for the communication
thread". So I tried the following:

#4 nodes: _

_ #BSUB -n 8
#BSUB -R "span[ptile=4]"
#BSUB -R 'rusage[ngpus_shared=1]'
$charmrun_bin ++verbose ++scalable-start ++mpiexec  ++p 72 ++ppn 9 
$namd2_bin +idlepoll +setcpuaffinity +pemap 0-63:8.2,80-143:8.2 +commap
64,72,144,152  prod01.namd > prod01.out

_0.1 s/step
_
But at the end of the day the performances are orders of magnitude worst ...

I am obviously doing something wrong with the SMP instructions in
multi-nodes situation. By the way, I am also new to LSF instructions and
I might be wrong in my request of resources.

Since I saw in James Phillips presentation some amazing numbersyou
reached on Oak Ridge SUMMITDEV, I was wondering if I could benefit from
your experience on this very architecture ?

Is it possible that I miss the scaling on multiple nodes because of the
size of my system (200,000 atoms) which is smaller than the 1M atom
benchmark presented by James Phillips ?

Thanks,
Adrien

-- 
*Adrien Cerdan*
PhD student
/Laboratoire d’Ingénierie des Fonctions Moléculaires/
ISIS, Université de Strasbourg
8 allée G. Monge - BP 70028
67083 Strasbourg Cedex - France
acerdan_at_unistra.fr

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2018 - 23:20:43 CST