AW: [SOLVED] Charmrun> error x attaching to node

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Tue Apr 30 2013 - 07:09:22 CDT

What I forgot to mention:

 

This solution should solve the problem with

 

Charmrun> error 0 attaching to node

 

followed by

 

Timeout waiting for node-program to connect

 

not followed by

 

Socket closed before recv.

 

which usually is a ibverbs problem with charmrun and can be solved by using
an MPI like OpenMPI.

 

Norman Geist.

 

Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag
von Norman Geist
Gesendet: Dienstag, 30. April 2013 12:09
An: Namd Mailing List
Betreff: namd-l: [SOLVED] Charmrun> error x attaching to node

 

Hello NAMD users,

 

as a hint for all people having the mentioned problem while running NAMD in
parallel across multiple nodes :

 

Charmrun> error 0 attaching to node

 

with the same or other numbers for error, because there's no solution to
find out there so far and it is driving one nuts, I decided to tell you what
the most likely problem with your network configuration is. Very likely your
local DNS configuration from "/etc/hosts" on the compute nodes contains an
entry that resolves the hostname of the compute node to a loopback
interface. This often looks like:

 

127.0.1.1 hostname

or

127.0.0.1 hostname

 

You can check this while doing a ping to the hostname, while you are logged
in at a compute node "ping hostname". If this returns an 127.x.x.x address,
your local DNS configuration is not suitable for charmrun as for charmrun
it's important, that the hostname resolves to an outgoing IP address, best
choice should be the network you want to use for the computation
communication. Otherwise, the node will not be able to connect to the other
nodes, as it is caught within the internal loopback network. This is also
important for using IBverbs as charmrun needs to resolve the IPoIB IP
address to the real Infiniband HCA.

I hope this saves you spending a lot of time googling around without finding
a solution.

 

Good luck

 

Norman Geist

 

PS: Other errors can be, that NAMD is not installed on a shared drive and
has a different path on the compute nodes, ++verbose for charmrun should
point out then.

This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:21:09 CST