Spawning too many gpu processes on the first node

From: Robert Sawko (RSawko_at_uk.ibm.com)
Date: Thu Dec 01 2016 - 13:10:12 CST

Hi,

I am struggling with a strange issue. I am trying to run a GPU version
of NAMD2.12b on multiple node and over ibverbs. I am running the
following script (only relevant parts):

#BSUB -W 01:00
#BSUB -R "span[ptile=4]"
#BSUB -n 8

AFFINITY="+commap 0,8,112,120 +pemap 16-111:8.2"
charmrun +p48 ++ppn 6 \
    ++mpiexec ++remote-shell ./mympiexec \
    \${NAMDBIN} +devices 0,1,2,3 \${AFFINITY} \
    29.conf

NAMD reports correctly the bindings to each of 8 GPUs. However, when I
run nvdia-smi utitlity on the same nodes. I get perplexing output:

Node 1
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 32241 C ...s/panther/local/apps/ibm/namd/2.12b/namd2 87MiB |
| 1 32242 C ...s/panther/local/apps/ibm/namd/2.12b/namd2 87MiB |
| 2 32244 C ...s/panther/local/apps/ibm/namd/2.12b/namd2 86MiB |
| 3 32246 C ...s/panther/local/apps/ibm/namd/2.12b/namd2 88MiB |
+-----------------------------------------------------------------------------+
Node 2
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 15988 C ...s/panther/local/apps/ibm/namd/2.12b/namd2 140MiB |
| 0 15989 C ...s/panther/local/apps/ibm/namd/2.12b/namd2 115MiB |
| 1 15988 C ...s/panther/local/apps/ibm/namd/2.12b/namd2 122MiB |
| 1 15989 C ...s/panther/local/apps/ibm/namd/2.12b/namd2 133MiB |
| 2 15988 C ...s/panther/local/apps/ibm/namd/2.12b/namd2 122MiB |
| 2 15991 C ...s/panther/local/apps/ibm/namd/2.12b/namd2 86MiB |
| 3 15988 C ...s/panther/local/apps/ibm/namd/2.12b/namd2 122MiB |
| 3 15993 C ...s/panther/local/apps/ibm/namd/2.12b/namd2 87MiB |
+-----------------------------------------------------------------------------+

This cannot be correct! Also, I tried the STMV 1, 20 and 210 as scaling
performance and I fail to see any scaling so I am sure that's something
is wrong, but I fail to see what I am doing wrong in my submission
script.

Please advise,
Robert

--
Dr Robert Sawko
Research Staff Member, IBM
Daresbury Laboratory
Keckwick Lane, Warrington
WA4 4AD
United Kingdom
--
Email (IBM): RSawko_at_uk.ibm.com
Email (STFC): robert.sawko_at_stfc.ac.uk
Phone (office): +44 (0) 1925 60 3967
Phone (mobile): +44 778 830 8522
Profile page:
http://researcher.watson.ibm.com/researcher/view.php?person=uk-RSawko
--
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

This archive was generated by hypermail 2.1.6 : Tue Dec 27 2016 - 23:22:40 CST