Re: AW: namd 2.9 ibverbs issues

From: Michael Galloway (gallowaymd_at_ornl.gov)
Date: Mon Sep 10 2012 - 08:55:42 CDT

good day,

i have looked through the mail list a fair bit.

we do have IPoIB running, what is the syntax for running namd over that
interface?
simply specify a nodelist that contains the ip's/hostnames of the IB fabric?

--- michael

On 09/10/2012 03:22 AM, Norman Geist wrote:
> Hi Micheal,
>
> A search through the mailing list would have pointed out, that the
> precompiled ibverbs binaries doesn't work on every infiniband setup. You can
> either compile on your own or use IPoIB which is more easy to use. But as
> another of your requests on the list shows, compiling on your own seems to
> be a possible problem so far.
>
> If you have IPoIB installed, a interface called ib0 in ifconfig would
> indicate, you could simply use this interfaces with the standard udp or tcp
> versions. But therefore, for proper performance, you should set the mode to
> connected and the mtu to 65520.
>
> (/sys/class/net/ib0/mode|mtu)
>
> This will also use the infiniband and is compatible with the precompiled
> binaries. And me in personal, I have never observed a performance advantage
> when using ibverbs against IPoIB (4x (2x Xeon 6-Core + 2 Tesla C2050)
>
> Norman Geist.
>
>
>> -----Ursprüngliche Nachricht-----
>> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
>> Auftrag von Michael Galloway
>> Gesendet: Samstag, 8. September 2012 19:29
>> An: namd-l
>> Betreff: namd-l: namd 2.9 ibverbs issues
>>
>> ok, i'm trying to get the ibverbs binary to run, and i'm struggling a
>> bit. i have a script i'm using (copied and modified from here:
>> http://www.isgtw.org/feed-item/open-science-grid-work-log-namd-pbs-and-
>> infiniband-nersc-dirac)
>> my script is:
>>
>>
>> #!/bin/bash
>>
>> set -x
>> set -e
>>
>> # build a node list file based on the PBS
>> # environment in a form suitable for NAMD/charmrun
>>
>> nodefile=/tmp/$PBS_JOBID.nodelist
>> echo group main > $nodefile
>> nodes=$( cat $PBS_NODEFILE )
>> for node in $nodes; do
>> echo host $node >> $nodefile
>> done
>>
>> # find the cluster's mpiexec
>> MPIEXEC=$(which mpiexec)
>> NAMD_HOME=/home/mgx/testing/namd-ibverbs/2.9-ibverbs
>>
>> # Tell charmrun to use all the available nodes, the nodelist built
>> above
>> and the cluster's MPI.
>> CHARMARGS="+p48 ++nodelist $nodefile"
>> ${NAMD_HOME}/charmrun \
>> ${CHARMARGS} ++verbose ++mpiexec ++remote-shell \
>> ${MPIEXEC} ${NAMD_HOME}/namd2
>>
>> i run via torque/maui with:
>>
>> qsub -l walltime=06:00:00 -l nodes=2:ppn=12 dirac.s
>>
>> the job shows in the queue, then fails with one of these outcomes:
>>
>> + MPIEXEC=/shared/openmpi-1.6.1/gcc/bin/mpiexec
>> + NAMD_HOME=/home/mgx/testing/namd-ibverbs/2.9-ibverbs
>> + CHARMARGS='+p48 ++nodelist /tmp/196.cmbcluster.nodelist'
>> + /home/mgx/testing/namd-ibverbs/2.9-ibverbs/charmrun +p48 ++nodelist
>> /tmp/196.cmbcluster.nodelist ++verbose ++mpiexec ++remote-shell
>> /shared/openmpi-1.6.1/gcc/bin/mpiex
>> ec /home/mgx/testing/namd-ibverbs/2.9-ibverbs/namd2
>> Charmrun> charmrun started...
>> Charmrun> mpiexec started
>> Charmrun> node programs all started
>> Charmrun> node programs all connected
>> Charmrun> started all node programs in 3.777 seconds.
>> Charmrun: error on request socket--
>> Socket closed before recv.
>> mpiexec: killing job...
>>
>> or
>>
>> + MPIEXEC=/shared/openmpi-1.6.1/gcc/bin/mpiexec
>> + NAMD_HOME=/home/mgx/testing/namd-ibverbs/2.9-ibverbs
>> + CHARMARGS='+p48 ++nodelist /tmp/197.cmbcluster.nodelist'
>> + /home/mgx/testing/namd-ibverbs/2.9-ibverbs/charmrun +p48 ++nodelist
>> /tmp/197.cmbcluster.nodelist ++verbose ++mpiexec ++remote-shell
>> /shared/openmpi-1.6.1/gcc/bin/mpiex
>> ec /home/mgx/testing/namd-ibverbs/2.9-ibverbs/namd2
>> Charmrun> charmrun started...
>> Charmrun> mpiexec started
>> Charmrun> node programs all started
>> Charmrun> error 0 attaching to node:
>> Timeout waiting for node-program to connect
>> mpiexec: killing job...
>>
>> verbose error and outputs for the jobs are here:
>>
>> http://pastebin.com/FpgY4NYi
>> http://pastebin.com/QXxELwY4
>>
>> is my run script incorrect? the first error looks like more
>> segfaulting,
>> but i cannot tell why the second run is failing.
>>
>> --- michael
>>
>>
>

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:22:03 CST