From: Robert Sawko (RSawko_at_uk.ibm.com)
Date: Thu Nov 17 2016 - 05:51:21 CST
Hello,
I am trying to run NAMD on a Firestone cluster i.e. Power8+4xK80
and I am having problems with multiple node runs. With the help of
Jim Phillips, I compiled with XL an ibverbs, smp version of charm and
Power-xLC version of NAMD. I can confirm that I can run on a single node
and observe utilisation of all four GPUs.
Our cluster is using LSF as a batch system and rsh or even ssh between
compute nodes has been switched off by admins. Therefore, Jim advised
to use mpirun so I am using OpenMPI. My understanding is that this is
only to spawn the processes. Additionally, with Power processor there's
a PE and communicator threads processor affinity setting. Also, I added
runscript with LD_LIBRARY_PATH. This is my script:
```
#BSUB -J namd
#BSUB -oo namd.stdout
#BSUB -eo namd.stderr
#BSUB -q panther
#BSUB -W 01:00
#BSUB -R "span[ptile=4]"
#BSUB -n 8
#BSUB -data /gpfs/fairthorpe/local/HCP004/pxs01/rrs59-pxs01/benchmarks/namd/benchmarks/namd_case
## This is data movement...
rm -rf ${HOME}/namd_on_2nodes 2> /dev/null
mkdir -p ${HOME}/namd_on_2nodes
cd ${HOME}/namd_on_2nodes
bstage in -all
NAMDBIN=/gpfs/panther/local/apps/ibm/namd/2016.09.13+cuda/namd2
AFFINITY="+commap 0,8,112,120 +pemap 16-111:8.2"
charmrun ++verbose \
++runscript ./runscript \
+p48 ++ppn 6 \
++mpiexec ++remote-shell mpiexec \
${NAMDBIN} ++verbose +idlepoll +devices 0,1,2,3 ${AFFINITY} \
29.conf > log.namd2
```
I get a timeout error from Charm.
Charmrun> charmrun started...
Charmrun> mpiexec started
Charmrun> node programs all started
Charmrun> error attaching to node '127.0.0.1':
Timeout waiting for node-program to connect
I am attaching also the standard output from NAMD.
https://ibm/ent/box.com/s/6lalwn87xau92oi1p6mq6l4fjfva2oqn
There's clearly a problem with connection. I have found similar problems
on the mailing list like for instance here:
http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l.2011-2012/2403.html
but I am not sure if they got resolved.
Please let know if you can assist on this.
Best wishes,
Robert
-- Dr Robert Sawko Research Staff Member, IBM Daresbury Laboratory Keckwick Lane, Warrington WA4 4AD United Kingdom -- Email (IBM): RSawko_at_uk.ibm.com Email (STFC): robert.sawko_at_stfc.ac.uk Phone (office): +44 (0) 1925 60 3967 Phone (mobile): +44 778 830 8522 Profile page: http://researcher.watson.ibm.com/researcher/view.php?person=uk-RSawko --Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
This archive was generated by hypermail 2.1.6 : Sun Dec 31 2017 - 23:20:48 CST