namd 2.9 ibverbs issues

From: Michael Galloway (gallowaymd_at_ornl.gov)
Date: Sat Sep 08 2012 - 12:29:12 CDT

ok, i'm trying to get the ibverbs binary to run, and i'm struggling a
bit. i have a script i'm using (copied and modified from here:
http://www.isgtw.org/feed-item/open-science-grid-work-log-namd-pbs-and-infiniband-nersc-dirac)
my script is:

#!/bin/bash

set -x
set -e

# build a node list file based on the PBS
# environment in a form suitable for NAMD/charmrun

nodefile=/tmp/$PBS_JOBID.nodelist
echo group main > $nodefile
nodes=$( cat $PBS_NODEFILE )
for node in $nodes; do
   echo host $node >> $nodefile
done

# find the cluster's mpiexec
MPIEXEC=$(which mpiexec)
NAMD_HOME=/home/mgx/testing/namd-ibverbs/2.9-ibverbs

# Tell charmrun to use all the available nodes, the nodelist built above
and the cluster's MPI.
CHARMARGS="+p48 ++nodelist $nodefile"
${NAMD_HOME}/charmrun \
${CHARMARGS} ++verbose ++mpiexec ++remote-shell \
${MPIEXEC} ${NAMD_HOME}/namd2

i run via torque/maui with:

qsub -l walltime=06:00:00 -l nodes=2:ppn=12 dirac.s

the job shows in the queue, then fails with one of these outcomes:

+ MPIEXEC=/shared/openmpi-1.6.1/gcc/bin/mpiexec
+ NAMD_HOME=/home/mgx/testing/namd-ibverbs/2.9-ibverbs
+ CHARMARGS='+p48 ++nodelist /tmp/196.cmbcluster.nodelist'
+ /home/mgx/testing/namd-ibverbs/2.9-ibverbs/charmrun +p48 ++nodelist
/tmp/196.cmbcluster.nodelist ++verbose ++mpiexec ++remote-shell
/shared/openmpi-1.6.1/gcc/bin/mpiex
ec /home/mgx/testing/namd-ibverbs/2.9-ibverbs/namd2
Charmrun> charmrun started...
Charmrun> mpiexec started
Charmrun> node programs all started
Charmrun> node programs all connected
Charmrun> started all node programs in 3.777 seconds.
Charmrun: error on request socket--
Socket closed before recv.
mpiexec: killing job...

or

+ MPIEXEC=/shared/openmpi-1.6.1/gcc/bin/mpiexec
+ NAMD_HOME=/home/mgx/testing/namd-ibverbs/2.9-ibverbs
+ CHARMARGS='+p48 ++nodelist /tmp/197.cmbcluster.nodelist'
+ /home/mgx/testing/namd-ibverbs/2.9-ibverbs/charmrun +p48 ++nodelist
/tmp/197.cmbcluster.nodelist ++verbose ++mpiexec ++remote-shell
/shared/openmpi-1.6.1/gcc/bin/mpiex
ec /home/mgx/testing/namd-ibverbs/2.9-ibverbs/namd2
Charmrun> charmrun started...
Charmrun> mpiexec started
Charmrun> node programs all started
Charmrun> error 0 attaching to node:
Timeout waiting for node-program to connect
mpiexec: killing job...

verbose error and outputs for the jobs are here:

http://pastebin.com/FpgY4NYi
http://pastebin.com/QXxELwY4

is my run script incorrect? the first error looks like more segfaulting,
but i cannot tell why the second run is failing.

--- michael

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:22:03 CST