NAMD GPU+ibverbs on multiple nodes: timeout problems

From: Robert Sawko (RSawko_at_uk.ibm.com)
Date: Thu Nov 17 2016 - 05:51:21 CST

Hello,

I am trying to run NAMD on a Firestone cluster i.e. Power8+4xK80
and I am having problems with multiple node runs. With the help of
Jim Phillips, I compiled with XL an ibverbs, smp version of charm and
Power-xLC version of NAMD. I can confirm that I can run on a single node
and observe utilisation of all four GPUs.

Our cluster is using LSF as a batch system and rsh or even ssh between
compute nodes has been switched off by admins. Therefore, Jim advised
to use mpirun so I am using OpenMPI. My understanding is that this is
only to spawn the processes. Additionally, with Power processor there's
a PE and communicator threads processor affinity setting. Also, I added
runscript with LD_LIBRARY_PATH. This is my script:

```
#BSUB -J namd
#BSUB -oo namd.stdout
#BSUB -eo namd.stderr
#BSUB -q panther
#BSUB -W 01:00
#BSUB -R "span[ptile=4]"
#BSUB -n 8
#BSUB -data /gpfs/fairthorpe/local/HCP004/pxs01/rrs59-pxs01/benchmarks/namd/benchmarks/namd_case

## This is data movement...
rm -rf ${HOME}/namd_on_2nodes 2> /dev/null
mkdir -p ${HOME}/namd_on_2nodes
cd ${HOME}/namd_on_2nodes
bstage in -all

NAMDBIN=/gpfs/panther/local/apps/ibm/namd/2016.09.13+cuda/namd2
AFFINITY="+commap 0,8,112,120 +pemap 16-111:8.2"

charmrun ++verbose \
    ++runscript ./runscript \
    +p48 ++ppn 6 \
    ++mpiexec ++remote-shell mpiexec \
    ${NAMDBIN} ++verbose +idlepoll +devices 0,1,2,3 ${AFFINITY} \
    29.conf > log.namd2
```

I get a timeout error from Charm.

Charmrun> charmrun started...
Charmrun> mpiexec started
Charmrun> node programs all started
Charmrun> error attaching to node '127.0.0.1':
Timeout waiting for node-program to connect

I am attaching also the standard output from NAMD.
https://ibm/ent/box.com/s/6lalwn87xau92oi1p6mq6l4fjfva2oqn

There's clearly a problem with connection. I have found similar problems
on the mailing list like for instance here:
http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l.2011-2012/2403.html
but I am not sure if they got resolved.

Please let know if you can assist on this.

Best wishes,
Robert

--
Dr Robert Sawko
Research Staff Member, IBM
Daresbury Laboratory
Keckwick Lane, Warrington
WA4 4AD
United Kingdom
--
Email (IBM): RSawko_at_uk.ibm.com
Email (STFC): robert.sawko_at_stfc.ac.uk
Phone (office): +44 (0) 1925 60 3967
Phone (mobile): +44 778 830 8522
Profile page:
http://researcher.watson.ibm.com/researcher/view.php?person=uk-RSawko
--Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

This archive was generated by hypermail 2.1.6 : Sun Dec 31 2017 - 23:20:48 CST