Re: Timeout waiting for node-program to connect

From: Jim Phillips (jim_at_ks.uiuc.edu)
Date: Thu Jul 22 2010 - 13:20:56 CDT

You should not need to share the working directory.

Try removing the ++pathfix ./ from the nodelist file. If you need this
option to point to the binary location you need two directories and
probably the full paths written out. See
http://charm.cs.uiuc.edu/manuals/html/install/4_2.html where it says:

In a network environment, charmrun must be able to locate the directory of
the executable. If all workstations share a common file name space this is
trivial. If they don't, charmrun will attempt to find the executable in a
directory with the same path from the $HOME directory. Pathname resolution
is performed as follows:

    1. The system computes the absolute path of pgm.
    2. If the absolute path starts with the equivalent of $HOME or the
current working directory, the beginning part of the path is replaced with
the environment variable $HOME or the current working directory. However,
if ++pathfix dir1 dir2 is specified in the nodes file (see above), the
part of the path matching dir1 is replaced with dir2.
    3. The system tries to locate this program (with modified pathname and
appended extension if specified) on all nodes.

-Jim

On Wed, 21 Jul 2010, Nicolas Floquet wrote:

> Dear NAMD users, i try to dispatch a MD calculation on the local machines
> connected by the local network.
> I declare each machine in the nodelist file as follow ("machines" file)
>
> group main ++pathfix ./
> host (here is the ip adresse of machine 1)
> host (here is the ip adresse of machine 2)
> host (here is the ip adresse of machine 3)
>
> my launching script is like this :
> #!/bin/csh
> nohup [namd dir]/charmrun ++verbose +p12 ++nodelist [workingdir]/machines
> [namd dir]/namd2 [workingdir]/equi1.conf > [workingdir]/output&
>
> Here is the result in the output file:
> Charmrun remote shell(192.168.212.68.0)> remote responding...
> Charmrun remote shell(192.168.212.68.0)> starting node-program...
> Charmrun remote shell(192.168.212.68.0)> rsh phase successful.
> Charmrun remote shell(192.168.212.68.3)> remote responding...
> Charmrun remote shell(192.168.212.68.3)> starting node-program...
> Charmrun remote shell(192.168.212.68.3)> rsh phase successful.
> Charmrun remote shell(192.168.212.68.6)> remote responding...
> Charmrun remote shell(192.168.212.68.6)> starting node-program...
> Charmrun remote shell(192.168.212.68.6)> rsh phase successful.
> Charmrun remote shell(192.168.212.68.9)> remote responding...
> Charmrun remote shell(192.168.212.68.9)> starting node-program...
> Charmrun remote shell(192.168.212.68.9)> rsh phase successful.
> Charmrun remote shell(192.168.212.100.5)> remote responding...
> Charmrun remote shell(192.168.212.100.5)> starting node-program...
> Charmrun remote shell(192.168.212.100.5)> rsh phase successful.
> Charmrun remote shell(192.168.212.100.2)> remote responding...
> Charmrun remote shell(192.168.212.100.2)> starting node-program...
> Charmrun remote shell(192.168.212.100.11)> remote responding...
> Charmrun remote shell(192.168.212.100.8)> remote responding...
> Charmrun remote shell(192.168.212.100.11)> starting node-program...
> Charmrun remote shell(192.168.212.100.8)> starting node-program...
> Charmrun remote shell(192.168.212.100.8)> rsh phase successful.
> Charmrun remote shell(192.168.212.97.7)> remote responding...
> Charmrun remote shell(192.168.212.100.11)> rsh phase successful.
> Charmrun remote shell(192.168.212.97.7)> starting node-program...
> Charmrun remote shell(192.168.212.100.2)> rsh phase successful.
> Charmrun remote shell(192.168.212.97.1)> remote responding...
> Charmrun remote shell(192.168.212.97.7)> rsh phase successful.
> Charmrun remote shell(192.168.212.97.1)> starting node-program...
> Charmrun remote shell(192.168.212.97.1)> rsh phase successful.
> Charmrun remote shell(192.168.212.97.10)> remote responding...
> Charmrun remote shell(192.168.212.97.4)> remote responding...
> Charmrun remote shell(192.168.212.97.4)> starting node-program...
> Charmrun remote shell(192.168.212.97.4)> rsh phase successful.
> Charmrun remote shell(192.168.212.97.10)> starting node-program...
> Charmrun remote shell(192.168.212.97.10)> rsh phase successful.
> Charmrun> adding client 0: "192.168.212.68", IP:192.168.212.68
> Charmrun> adding client 1: "192.168.212.97", IP:192.168.212.97
> Charmrun> adding client 2: "192.168.212.100", IP:192.168.212.100
> Charmrun> adding client 3: "192.168.212.68", IP:192.168.212.68
> Charmrun> adding client 4: "192.168.212.97", IP:192.168.212.97
> Charmrun> adding client 5: "192.168.212.100", IP:192.168.212.100
> Charmrun> adding client 6: "192.168.212.68", IP:192.168.212.68
> Charmrun> adding client 7: "192.168.212.97", IP:192.168.212.97
> Charmrun> adding client 8: "192.168.212.100", IP:192.168.212.100
> Charmrun> adding client 9: "192.168.212.68", IP:192.168.212.68
> Charmrun> adding client 10: "192.168.212.97", IP:192.168.212.97
> Charmrun> adding client 11: "192.168.212.100", IP:192.168.212.100
> Charmrun> Charmrun = 127.0.1.1, port = 33244
> Charmrun> Sending "0 127.0.1.1 33244 23431 0" to client 0.
> Charmrun> find the node program "/home/nicolas/PROGRAMMES/NAMD/namd2" at
> "/home/nicolas/CALCULS/NAMD_TEST" for 0.
> Charmrun> Starting ssh 192.168.212.68 -l nicolas /bin/sh -f
> Charmrun> Sending "1 127.0.1.1 33244 23431 0" to client 1.
> Charmrun> find the node program "/home/nicolas/PROGRAMMES/NAMD/namd2" at
> "/home/nicolas/CALCULS/NAMD_TEST" for 1.
> Charmrun> Starting ssh 192.168.212.97 -l nicolas /bin/sh -f
> Charmrun> Sending "2 127.0.1.1 33244 23431 0" to client 2.
> Charmrun> find the node program "/home/nicolas/PROGRAMMES/NAMD/namd2" at
> "/home/nicolas/CALCULS/NAMD_TEST" for 2.
> Charmrun> Starting ssh 192.168.212.100 -l nicolas /bin/sh -f
> Charmrun> Sending "3 127.0.1.1 33244 23431 0" to client 3.
> Charmrun> find the node program "/home/nicolas/PROGRAMMES/NAMD/namd2" at
> "/home/nicolas/CALCULS/NAMD_TEST" for 3.
> Charmrun> Starting ssh 192.168.212.68 -l nicolas /bin/sh -f
> Charmrun> Sending "4 127.0.1.1 33244 23431 0" to client 4.
> Charmrun> find the node program "/home/nicolas/PROGRAMMES/NAMD/namd2" at
> "/home/nicolas/CALCULS/NAMD_TEST" for 4.
> Charmrun> Starting ssh 192.168.212.97 -l nicolas /bin/sh -f
> Charmrun> Sending "5 127.0.1.1 33244 23431 0" to client 5.
> Charmrun> find the node program "/home/nicolas/PROGRAMMES/NAMD/namd2" at
> "/home/nicolas/CALCULS/NAMD_TEST" for 5.
> Charmrun> Starting ssh 192.168.212.100 -l nicolas /bin/sh -f
> Charmrun> Sending "6 127.0.1.1 33244 23431 0" to client 6.
> Charmrun> find the node program "/home/nicolas/PROGRAMMES/NAMD/namd2" at
> "/home/nicolas/CALCULS/NAMD_TEST" for 6.
> Charmrun> Starting ssh 192.168.212.68 -l nicolas /bin/sh -f
> Charmrun> Sending "7 127.0.1.1 33244 23431 0" to client 7.
> Charmrun> find the node program "/home/nicolas/PROGRAMMES/NAMD/namd2" at
> "/home/nicolas/CALCULS/NAMD_TEST" for 7.
> Charmrun> Starting ssh 192.168.212.97 -l nicolas /bin/sh -f
> Charmrun> Sending "8 127.0.1.1 33244 23431 0" to client 8.
> Charmrun> find the node program "/home/nicolas/PROGRAMMES/NAMD/namd2" at
> "/home/nicolas/CALCULS/NAMD_TEST" for 8.
> Charmrun> Starting ssh 192.168.212.100 -l nicolas /bin/sh -f
> Charmrun> Sending "9 127.0.1.1 33244 23431 0" to client 9.
> Charmrun> find the node program "/home/nicolas/PROGRAMMES/NAMD/namd2" at
> "/home/nicolas/CALCULS/NAMD_TEST" for 9.
> Charmrun> Starting ssh 192.168.212.68 -l nicolas /bin/sh -f
> Charmrun> Sending "10 127.0.1.1 33244 23431 0" to client 10.
> Charmrun> find the node program "/home/nicolas/PROGRAMMES/NAMD/namd2" at
> "/home/nicolas/CALCULS/NAMD_TEST" for 10.
> Charmrun> Starting ssh 192.168.212.97 -l nicolas /bin/sh -f
> Charmrun> Sending "11 127.0.1.1 33244 23431 0" to client 11.
> Charmrun> find the node program "/home/nicolas/PROGRAMMES/NAMD/namd2" at
> "/home/nicolas/CALCULS/NAMD_TEST" for 11.
> Charmrun> Starting ssh 192.168.212.100 -l nicolas /bin/sh -f
> Charmrun> Waiting for 0-th client to connect.
> Charmrun> Waiting for 1-th client to connect.
> Charmrun> Waiting for 2-th client to connect.
> Charmrun> Waiting for 3-th client to connect.
> Charmrun> client 0 connected (IP=192.168.212.68 data_port=39548)
> Charmrun> client 3 connected (IP=192.168.212.68 data_port=37785)
> Charmrun> client 6 connected (IP=192.168.212.68 data_port=54621)
> Charmrun> client 9 connected (IP=192.168.212.68 data_port=51703)
> Charmrun> Waiting for 4-th client to connect.
>
> It is the same using ssh after including setenv CONV_RSH ssh in the running
> script.
> Connections between the machines work fine with ssh or rsh without password.
>
> Do we need to share the working directory or not ?
>
> ty for any help !!!!!!!!
>
> nicolas
>
>
>
>
> --
> ---------------------------
> Dr. Nicolas FLOQUET
> Chargé de Recherches CNRS
> Institut des Biomolécules Max Mousseron (IBMM UMR5247)
> Faculté de Pharmacie, 15 av. Charles Flahault
> B.P. 14491
> 34093 MONTPELLIER CEDEX 5
> ---------------------------
> Tél: 04 67 54 85 50
> Fax: 04 67 54 86 54
> ---------------------------
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:54:20 CST