AW: AW: namd 2.9 ibverbs issues

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Mon Sep 10 2012 - 10:52:43 CDT

> -----Ursprüngliche Nachricht-----
> Von: Michael Galloway [mailto:gallowaymd_at_ornl.gov]
> Gesendet: Montag, 10. September 2012 15:56
> An: Norman Geist
> Cc: Namd Mailing List
> Betreff: Re: AW: namd-l: namd 2.9 ibverbs issues
>
> good day,

Hi,

>
> i have looked through the mail list a fair bit.
>
> we do have IPoIB running, what is the syntax for running namd over that
> interface?
> simply specify a nodelist that contains the ip's/hostnames of the IB
> fabric?

Yes.

Norman.

>
> --- michael
>
> On 09/10/2012 03:22 AM, Norman Geist wrote:
> > Hi Micheal,
> >
> > A search through the mailing list would have pointed out, that the
> > precompiled ibverbs binaries doesn't work on every infiniband setup.
> You can
> > either compile on your own or use IPoIB which is more easy to use.
> But as
> > another of your requests on the list shows, compiling on your own
> seems to
> > be a possible problem so far.
> >
> > If you have IPoIB installed, a interface called ib0 in ifconfig would
> > indicate, you could simply use this interfaces with the standard udp
> or tcp
> > versions. But therefore, for proper performance, you should set the
> mode to
> > connected and the mtu to 65520.
> >
> > (/sys/class/net/ib0/mode|mtu)
> >
> > This will also use the infiniband and is compatible with the
> precompiled
> > binaries. And me in personal, I have never observed a performance
> advantage
> > when using ibverbs against IPoIB (4x (2x Xeon 6-Core + 2 Tesla C2050)
> >
> > Norman Geist.
> >
> >
> >> -----Ursprüngliche Nachricht-----
> >> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> >> Auftrag von Michael Galloway
> >> Gesendet: Samstag, 8. September 2012 19:29
> >> An: namd-l
> >> Betreff: namd-l: namd 2.9 ibverbs issues
> >>
> >> ok, i'm trying to get the ibverbs binary to run, and i'm struggling
> a
> >> bit. i have a script i'm using (copied and modified from here:
> >> http://www.isgtw.org/feed-item/open-science-grid-work-log-namd-pbs-
> and-
> >> infiniband-nersc-dirac)
> >> my script is:
> >>
> >>
> >> #!/bin/bash
> >>
> >> set -x
> >> set -e
> >>
> >> # build a node list file based on the PBS
> >> # environment in a form suitable for NAMD/charmrun
> >>
> >> nodefile=/tmp/$PBS_JOBID.nodelist
> >> echo group main > $nodefile
> >> nodes=$( cat $PBS_NODEFILE )
> >> for node in $nodes; do
> >> echo host $node >> $nodefile
> >> done
> >>
> >> # find the cluster's mpiexec
> >> MPIEXEC=$(which mpiexec)
> >> NAMD_HOME=/home/mgx/testing/namd-ibverbs/2.9-ibverbs
> >>
> >> # Tell charmrun to use all the available nodes, the nodelist built
> >> above
> >> and the cluster's MPI.
> >> CHARMARGS="+p48 ++nodelist $nodefile"
> >> ${NAMD_HOME}/charmrun \
> >> ${CHARMARGS} ++verbose ++mpiexec ++remote-shell \
> >> ${MPIEXEC} ${NAMD_HOME}/namd2
> >>
> >> i run via torque/maui with:
> >>
> >> qsub -l walltime=06:00:00 -l nodes=2:ppn=12 dirac.s
> >>
> >> the job shows in the queue, then fails with one of these outcomes:
> >>
> >> + MPIEXEC=/shared/openmpi-1.6.1/gcc/bin/mpiexec
> >> + NAMD_HOME=/home/mgx/testing/namd-ibverbs/2.9-ibverbs
> >> + CHARMARGS='+p48 ++nodelist /tmp/196.cmbcluster.nodelist'
> >> + /home/mgx/testing/namd-ibverbs/2.9-ibverbs/charmrun +p48
> ++nodelist
> >> /tmp/196.cmbcluster.nodelist ++verbose ++mpiexec ++remote-shell
> >> /shared/openmpi-1.6.1/gcc/bin/mpiex
> >> ec /home/mgx/testing/namd-ibverbs/2.9-ibverbs/namd2
> >> Charmrun> charmrun started...
> >> Charmrun> mpiexec started
> >> Charmrun> node programs all started
> >> Charmrun> node programs all connected
> >> Charmrun> started all node programs in 3.777 seconds.
> >> Charmrun: error on request socket--
> >> Socket closed before recv.
> >> mpiexec: killing job...
> >>
> >> or
> >>
> >> + MPIEXEC=/shared/openmpi-1.6.1/gcc/bin/mpiexec
> >> + NAMD_HOME=/home/mgx/testing/namd-ibverbs/2.9-ibverbs
> >> + CHARMARGS='+p48 ++nodelist /tmp/197.cmbcluster.nodelist'
> >> + /home/mgx/testing/namd-ibverbs/2.9-ibverbs/charmrun +p48
> ++nodelist
> >> /tmp/197.cmbcluster.nodelist ++verbose ++mpiexec ++remote-shell
> >> /shared/openmpi-1.6.1/gcc/bin/mpiex
> >> ec /home/mgx/testing/namd-ibverbs/2.9-ibverbs/namd2
> >> Charmrun> charmrun started...
> >> Charmrun> mpiexec started
> >> Charmrun> node programs all started
> >> Charmrun> error 0 attaching to node:
> >> Timeout waiting for node-program to connect
> >> mpiexec: killing job...
> >>
> >> verbose error and outputs for the jobs are here:
> >>
> >> http://pastebin.com/FpgY4NYi
> >> http://pastebin.com/QXxELwY4
> >>
> >> is my run script incorrect? the first error looks like more
> >> segfaulting,
> >> but i cannot tell why the second run is failing.
> >>
> >> --- michael
> >>
> >>
> >

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:22:03 CST