AW: Problem running NAMD 2.8 with ibverbs

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Mon Jan 09 2012 - 00:32:06 CST

Hi Moritz,

I would answer you in german, as I'm here in the University of Greifswald,
but we should keep the mailing lists international usability.

Pls send me the output of:

>ifconfig -a
>cat /sys/class/net/ib0/mode
>cat /sys/class/net/ib0/mtu

What Infiniband hardware do you have (HCA,Switch)?
How u installed it? (have u used OFED?)

Have you thought of installing the IPoIB driver? <- needed for resolution of
ip-addresses to the right infiniband-addresses.
Have you tried the ib tools like ping etc, did they work??

Let me know
Norman Geist.

> -----Ursprüngliche Nachricht-----
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> Auftrag von Moritz Schlarb
> Gesendet: Freitag, 6. Januar 2012 16:42
> An: namd-l_at_ks.uiuc.edu
> Betreff: Re: namd-l: Problem running NAMD 2.8 with ibverbs
>
> Hello again,
>
> I just want to add that it works with the precompiled version from the
> homepage (NAMD 2.8 for Linux-x86_64-ibverbs).
>
> Am 06.01.2012 16:29, schrieb Moritz Schlarb:
> > Hello everyone,
> >
> > I'm currently working on deploying NAMD to the linux cluster at the
> > Johannes Gutenberg university Mainz, Germany.
> >
> > I successfully compiled NAMD with the MVAPICH2 MPI and now I wanted
> to
> > compare its speed to a version of NAMD using ibverbs.
> >
> > I compiled charm++ using the following commandline:
> > $ ./build charm++ net-linux-x86_64 ibverbs --no-build-shared
> > --with-production
> >
> > and running the megatest works (nodelist with two nodes):
> > $ ./charmrun ++remote-shell ssh +p2 ./pgm
> > [...]
> > test 53: completed (5.47 sec)
> > All tests completed, exiting
> >
> > Then I configure namd with the following line:
> > $ ./config Linux-x86_64-g++ --charm-arch net-linux-x86_64-ibverbs
> > which compiles cleanly.
> >
> > The resulting namd2 executable works fine when I run it locally:
> > $ ./namd2 src/alanin
> > [...]
> > WallClock: 0.035015 CPUTime: 0.010000 Memory: 31.363281 MB
> > Program finished.
> > $ charmrun ++local +p2 namd2 src/alanin
> > [...]
> > WallClock: 1.304384 CPUTime: 1.270000 Memory: 70.000000 MB
> >
> > But when I want to run it on remote nodes (using the same nodelist as
> > above), I get a timeout:
> > $ ./charmrun ++remote-shell ssh +p2 ++verbose namd2 src/alanin
> > [...]
> > Charmrun> Waiting for 0-th client to connect.
> > Charmrun> error 0 attaching to node:
> > Timeout waiting for node-program to connect
> >
> > When I look at an htop on the remote node, I see some shells spawning
> > and exiting.
> >
> > According to this answer from the mailing list, I already tried using
> > ++useip and ++usehostname in the charmrun commandline and specified
> the
> > infiniband ip addresses in the nodelist, but neither of that worked.
> >
> > I've attached the complete run log and uploaded the tarred namd
> > directories (namd_full.tgz is the whole NAMD_2.8._Source dir,
> namd.tgz
> > is only the Linux-x86_64-g++ dir) here:
> > https://fileshare.zdv.uni-mainz.de/36d1bd67-45b2-45d4-9c5b-
> b600d1d28126.repository
> >
> >
> > Thanks in advance,
> > Moritz
> >

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 05:24:39 CST