Problem running NAMD 2.8 with ibverbs

From: Moritz Schlarb (schlarbm_at_uni-mainz.de)
Date: Fri Jan 06 2012 - 09:29:18 CST

Hello everyone,

I'm currently working on deploying NAMD to the linux cluster at the
Johannes Gutenberg university Mainz, Germany.

I successfully compiled NAMD with the MVAPICH2 MPI and now I wanted to
compare its speed to a version of NAMD using ibverbs.

I compiled charm++ using the following commandline:
$ ./build charm++ net-linux-x86_64 ibverbs --no-build-shared
--with-production

and running the megatest works (nodelist with two nodes):
$ ./charmrun ++remote-shell ssh +p2 ./pgm
[...]
test 53: completed (5.47 sec)
All tests completed, exiting

Then I configure namd with the following line:
$ ./config Linux-x86_64-g++ --charm-arch net-linux-x86_64-ibverbs
which compiles cleanly.

The resulting namd2 executable works fine when I run it locally:
$ ./namd2 src/alanin
[...]
WallClock: 0.035015 CPUTime: 0.010000 Memory: 31.363281 MB
Program finished.
$ charmrun ++local +p2 namd2 src/alanin
[...]
WallClock: 1.304384 CPUTime: 1.270000 Memory: 70.000000 MB

But when I want to run it on remote nodes (using the same nodelist as
above), I get a timeout:
$ ./charmrun ++remote-shell ssh +p2 ++verbose namd2 src/alanin
[...]
Charmrun> Waiting for 0-th client to connect.
Charmrun> error 0 attaching to node:
Timeout waiting for node-program to connect

When I look at an htop on the remote node, I see some shells spawning
and exiting.

According to this answer from the mailing list, I already tried using
++useip and ++usehostname in the charmrun commandline and specified the
infiniband ip addresses in the nodelist, but neither of that worked.

I've attached the complete run log and uploaded the tarred namd
directories (namd_full.tgz is the whole NAMD_2.8._Source dir, namd.tgz
is only the Linux-x86_64-g++ dir) here:
https://fileshare.zdv.uni-mainz.de/36d1bd67-45b2-45d4-9c5b-b600d1d28126.repository

Thanks in advance,
        Moritz

-- 
Moritz Schlarb
High Performance Computing
University Mainz, Germany

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:06 CST