AW: namd 2.9 ibverbs issues

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Mon Sep 10 2012 - 02:22:52 CDT

Hi Micheal,

A search through the mailing list would have pointed out, that the
precompiled ibverbs binaries doesn't work on every infiniband setup. You can
either compile on your own or use IPoIB which is more easy to use. But as
another of your requests on the list shows, compiling on your own seems to
be a possible problem so far.

If you have IPoIB installed, a interface called ib0 in ifconfig would
indicate, you could simply use this interfaces with the standard udp or tcp
versions. But therefore, for proper performance, you should set the mode to
connected and the mtu to 65520.

(/sys/class/net/ib0/mode|mtu)

This will also use the infiniband and is compatible with the precompiled
binaries. And me in personal, I have never observed a performance advantage
when using ibverbs against IPoIB (4x (2x Xeon 6-Core + 2 Tesla C2050)

Norman Geist.

> -----Ursprüngliche Nachricht-----
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> Auftrag von Michael Galloway
> Gesendet: Samstag, 8. September 2012 19:29
> An: namd-l
> Betreff: namd-l: namd 2.9 ibverbs issues
>
> ok, i'm trying to get the ibverbs binary to run, and i'm struggling a
> bit. i have a script i'm using (copied and modified from here:
> http://www.isgtw.org/feed-item/open-science-grid-work-log-namd-pbs-and-
> infiniband-nersc-dirac)
> my script is:
>
>
> #!/bin/bash
>
> set -x
> set -e
>
> # build a node list file based on the PBS
> # environment in a form suitable for NAMD/charmrun
>
> nodefile=/tmp/$PBS_JOBID.nodelist
> echo group main > $nodefile
> nodes=$( cat $PBS_NODEFILE )
> for node in $nodes; do
> echo host $node >> $nodefile
> done
>
> # find the cluster's mpiexec
> MPIEXEC=$(which mpiexec)
> NAMD_HOME=/home/mgx/testing/namd-ibverbs/2.9-ibverbs
>
> # Tell charmrun to use all the available nodes, the nodelist built
> above
> and the cluster's MPI.
> CHARMARGS="+p48 ++nodelist $nodefile"
> ${NAMD_HOME}/charmrun \
> ${CHARMARGS} ++verbose ++mpiexec ++remote-shell \
> ${MPIEXEC} ${NAMD_HOME}/namd2
>
> i run via torque/maui with:
>
> qsub -l walltime=06:00:00 -l nodes=2:ppn=12 dirac.s
>
> the job shows in the queue, then fails with one of these outcomes:
>
> + MPIEXEC=/shared/openmpi-1.6.1/gcc/bin/mpiexec
> + NAMD_HOME=/home/mgx/testing/namd-ibverbs/2.9-ibverbs
> + CHARMARGS='+p48 ++nodelist /tmp/196.cmbcluster.nodelist'
> + /home/mgx/testing/namd-ibverbs/2.9-ibverbs/charmrun +p48 ++nodelist
> /tmp/196.cmbcluster.nodelist ++verbose ++mpiexec ++remote-shell
> /shared/openmpi-1.6.1/gcc/bin/mpiex
> ec /home/mgx/testing/namd-ibverbs/2.9-ibverbs/namd2
> Charmrun> charmrun started...
> Charmrun> mpiexec started
> Charmrun> node programs all started
> Charmrun> node programs all connected
> Charmrun> started all node programs in 3.777 seconds.
> Charmrun: error on request socket--
> Socket closed before recv.
> mpiexec: killing job...
>
> or
>
> + MPIEXEC=/shared/openmpi-1.6.1/gcc/bin/mpiexec
> + NAMD_HOME=/home/mgx/testing/namd-ibverbs/2.9-ibverbs
> + CHARMARGS='+p48 ++nodelist /tmp/197.cmbcluster.nodelist'
> + /home/mgx/testing/namd-ibverbs/2.9-ibverbs/charmrun +p48 ++nodelist
> /tmp/197.cmbcluster.nodelist ++verbose ++mpiexec ++remote-shell
> /shared/openmpi-1.6.1/gcc/bin/mpiex
> ec /home/mgx/testing/namd-ibverbs/2.9-ibverbs/namd2
> Charmrun> charmrun started...
> Charmrun> mpiexec started
> Charmrun> node programs all started
> Charmrun> error 0 attaching to node:
> Timeout waiting for node-program to connect
> mpiexec: killing job...
>
> verbose error and outputs for the jobs are here:
>
> http://pastebin.com/FpgY4NYi
> http://pastebin.com/QXxELwY4
>
> is my run script incorrect? the first error looks like more
> segfaulting,
> but i cannot tell why the second run is failing.
>
> --- michael
>
>

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:22:03 CST