AW: Using nodelist file causes namd to hang

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Wed Apr 09 2014 - 05:28:51 CDT

Please try the same command without ++local and see if it still works.

> -----Ursprüngliche Nachricht-----
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> Auftrag von Douglas Houston
> Gesendet: Mittwoch, 9. April 2014 11:49
> An: ramya narasimhan
> Cc: Namd Mailing List
> Betreff: Re: namd-l: Using nodelist file causes namd to hang
>
> The result is the same whichever order the nodes are present in the
> list.
>
> What exactly is Charmrun waiting for at the "Waiting for 0-th client
> to connect." stage? Presumably the 0th client is the first in
> nodelist, and that a process is supposed to start on that node, then
> "connect" to Charmrun on the host machine?

Charmrun is just spawning the namd processes and now is waiting for them to
start to talk.

>
> Using the command top I see no evidence of anything new starting on
> the node, despite all the "starting node-program" and "rsh phase
> successful" messages that are output.
>
> Using "ps -u douglas" on the node shows a whole bunch of tcsh and sh
> shells and sleep processes starting then dying but nothing else.
>
> What does the line "Sending "0 129.215.237.187 57453 26737 0" to
> client 0" mean? How is this "sending" achieved? I see "port 57453" is
> mentioned in the output ...

Seems like being part of the parallel startup, where the spawned processes
get the information about each other.

>
>
>
>
> Quoting ramya narasimhan <ramya_jln_at_yahoo.co.in> on Wed, 9 Apr 2014
> 11:51:52 +0800 (SGT):
>
> > Just change the hostname [IP of the system] order in the
> > nodefile, so that the 0-th client will be itioc5 instead of itioc1.
> > To find whether the problem is with nodes.
> >
> >
> > Dr. Ramya.L.
> > On Tuesday, 8 April 2014 7:23 PM, Douglas Houston
> > <DouglasR.Houston_at_ed.ac.uk> wrote:
> >
> > Yes, with ping all the nodes resolve to full hostnames and IP
> > addresses. I tried putting IP addresses into nodelist instead of
> > hostnames but it still times out at "Waiting for 0-th client to
> connect"
> >
> >
> > Quoting Norman Geist <norman.geist_at_uni-greifswald.de> on Tue, 8 Apr
> > 2014 14:30:15 +0200:
> >
> >> On all the nodes? Otherwise try a nodelist with IP adresses instead
> of
> >> hostnames. If that works, you got a problem with local DNS.
> >>
> >> Norman Geist.
> >>
> >>
> >>> -----Ursprüngliche Nachricht-----
> >>> Von: Douglas Houston [mailto:DouglasR.Houston_at_ed.ac.uk]
> >>> Gesendet: Dienstag, 8. April 2014 14:14
> >>> An: Norman Geist
> >>> Cc: Namd Mailing List
> >>> Betreff: Re: AW: AW: namd-l: Using nodelist file causes namd to
> hang
> >>>
> >>> Thanks Norman. I had found that thread after my searches but it did
> >>> not seem to apply to my problem.
> >>>
> >>> "You can check this while doing a ping to the hostname, while you
> are
> >>> logged in at a compute node "ping hostname". If this returns an
> >>> 127.x.x.x address, your local DNS configuration is not suitable for
> >>> charmrun"
> >>>
> >>> My ping returns the full name and IP address of the node, not
> >>> 127.x.x.x.
> >>>
> >>>
> >>>
> >>> Quoting Norman Geist <norman.geist_at_uni-greifswald.de> on Tue, 8 Apr
> >>> 2014 13:22:41 +0200:
> >>>
> >>> > Now I remember that I already posted a solution for this some
> weeks
> >>> ago, you
> >>> > could have found it by using google.de. Maybe this helps you.
> >>> >
> >>> > http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l.2012-
> >>> 2013/2645.html
> >>> >
> >>> > Norman Geist.
> >>> >
> >>> >
> >>> >> -----Ursprüngliche Nachricht-----
> >>> >> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu]
> Im
> >>> >> Auftrag von Douglas Houston
> >>> >> Gesendet: Dienstag, 8. April 2014 12:53
> >>> >> An: Norman Geist
> >>> >> Cc: Namd Mailing List
> >>> >> Betreff: Re: AW: namd-l: Using nodelist file causes namd to hang
> >>> >>
> >>> >> Thanks for the tip Norman, but if I change my command to the
> >>> following
> >>> >> it still hangs at the same point:
> >>> >>
> >>> >> /usr/people/douglas/programs/NAMD_2.9_Linux-x86/charmrun +p12
> >>> >> ++remote-shell ssh
> >>> >> /usr/people/douglas/programs/NAMD_2.9_Linux-x86/namd2 ++verbose
> >>> >> mdrun.conf
> >>> >>
> >>> >>
> >>> >>
> >>> >> Quoting Norman Geist <norman.geist_at_uni-greifswald.de> on Tue, 8
> Apr
> >>> >> 2014 12:06:03 +0200:
> >>> >>
> >>> >> > Try the charmrun option "++remote-shell ssh".
> >>> >> >
> >>> >> > Norman Geist.
> >>> >> >
> >>> >> >> -----Ursprüngliche Nachricht-----
> >>> >> >> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-
> l_at_ks.uiuc.edu]
> >>> Im
> >>> >> >> Auftrag von Douglas Houston
> >>> >> >> Gesendet: Dienstag, 8. April 2014 11:30
> >>> >> >> An: namd-l_at_ks.uiuc.edu
> >>> >> >> Betreff: namd-l: Using nodelist file causes namd to hang
> >>> >> >>
> >>> >> >> I have two nodes connected via ethernet: itioc5 and itioc1
> >>> >> >>
> >>> >> >> I have the following in my nodelist file:
> >>> >> >>
> >>> >> >> group main
> >>> >> >> host itioc1
> >>> >> >> host itioc5
> >>> >> >>
> >>> >> >> I am using the following command:
> >>> >> >>
> >>> >> >> /usr/people/douglas/programs/NAMD_2.9_Linux-x86/charmrun +p12
> >>> >> >> /usr/people/douglas/programs/NAMD_2.9_Linux-x86/namd2
> ++verbose
> >>> >> >> mdrun.conf
> >>> >> >>
> >>> >> >> I get the following output:
> >>> >> >>
> >>> >> >> Charmrun> charmrun started...
> >>> >> >> Charmrun> using ./nodelist as nodesfile
> >>> >> >> Charmrun> adding client 0: "itioc1", IP:129.215.137.21
> >>> >> >> Charmrun> adding client 1: "itioc5", IP:129.215.237.186
> >>> >> >> Charmrun> adding client 2: "itioc1", IP:129.215.137.21
> >>> >> >> Charmrun> adding client 3: "itioc5", IP:129.215.237.186
> >>> >> >> Charmrun> adding client 4: "itioc1", IP:129.215.137.21
> >>> >> >> Charmrun> adding client 5: "itioc5", IP:129.215.237.186
> >>> >> >> Charmrun> adding client 6: "itioc1", IP:129.215.137.21
> >>> >> >> Charmrun> adding client 7: "itioc5", IP:129.215.237.186
> >>> >> >> Charmrun> adding client 8: "itioc1", IP:129.215.137.21
> >>> >> >> Charmrun> adding client 9: "itioc5", IP:129.215.237.186
> >>> >> >> Charmrun> adding client 10: "itioc1", IP:129.215.137.21
> >>> >> >> Charmrun> adding client 11: "itioc5", IP:129.215.237.186
> >>> >> >> Charmrun> Charmrun = 129.215.237.187, port = 58330
> >>> >> >> start_nodes_rsh
> >>> >> >> Charmrun> Sending "0 129.215.237.187 58330 19205 0" to client
> 0.
> >>> >> >> Charmrun> find the node program
> >>> >> >> "/usr/people/douglas/programs/NAMD_2.9_Linux-x86/namd2" at
> >>> >> >>
> >>> >>
> >>>
> "/usr/people/douglas/projects/UPS/targets/SCF/2AST/MD/parallelise_itioc
> >>> >> >> " for
> >>> >> >> 0.
> >>> >> >> Charmrun> Starting ssh itioc1 -l douglas /bin/sh -f
> >>> >> >> Charmrun> remote shell (itioc1:0) started
> >>> >> >> Charmrun> Sending "1 129.215.237.187 58330 19205 0" to client
> 1.
> >>> >> >> Charmrun> find the node program
> >>> >> >> "/usr/people/douglas/programs/NAMD_2.9_Linux-x86/namd2" at
> >>> >> >>
> >>> >>
> >>>
> "/usr/people/douglas/projects/UPS/targets/SCF/2AST/MD/parallelise_itioc
> >>> >> >> " for
> >>> >> >> 1.
> >>> >> >> Charmrun> Starting ssh itioc5 -l douglas /bin/sh -f
> >>> >> >> Charmrun> remote shell (itioc5:1) started
> >>> >> >> Charmrun> Sending "2 129.215.237.187 58330 19205 0" to client
> 2.
> >>> >> >> Charmrun> find the node program
> >>> >> >> "/usr/people/douglas/programs/NAMD_2.9_Linux-x86/namd2" at
> >>> >> >>
> >>> >>
> >>>
> "/usr/people/douglas/projects/UPS/targets/SCF/2AST/MD/parallelise_itioc
> >>> >> >> " for
> >>> >> >> 2.
> >>> >> >> Charmrun> Starting ssh itioc1 -l douglas /bin/sh -f
> >>> >> >> Charmrun> remote shell (itioc1:2) started
> >>> >> >> Charmrun> Sending "3 129.215.237.187 58330 19205 0" to client
> 3.
> >>> >> >> Charmrun> find the node program
> >>> >> >> "/usr/people/douglas/programs/NAMD_2.9_Linux-x86/namd2" at
> >>> >> >>
> >>> >>
> >>>
> "/usr/people/douglas/projects/UPS/targets/SCF/2AST/MD/parallelise_itioc
> >>> >> >> " for
> >>> >> >> 3.
> >>> >> >> Charmrun> Starting ssh itioc5 -l douglas /bin/sh -f
> >>> >> >> Charmrun> remote shell (itioc5:3) started
> >>> >> >> Charmrun> Sending "4 129.215.237.187 58330 19205 0" to client
> 4.
> >>> >> >> Charmrun> find the node program
> >>> >> >> "/usr/people/douglas/programs/NAMD_2.9_Linux-x86/namd2" at
> >>> >> >>
> >>> >>
> >>>
> "/usr/people/douglas/projects/UPS/targets/SCF/2AST/MD/parallelise_itioc
> >>> >> >> " for
> >>> >> >> 4.
> >>> >> >> Charmrun> Starting ssh itioc1 -l douglas /bin/sh -f
> >>> >> >> Charmrun> remote shell (itioc1:4) started
> >>> >> >> Charmrun> Sending "5 129.215.237.187 58330 19205 0" to client
> 5.
> >>> >> >> Charmrun> find the node program
> >>> >> >> "/usr/people/douglas/programs/NAMD_2.9_Linux-x86/namd2" at
> >>> >> >>
> >>> >>
> >>>
> "/usr/people/douglas/projects/UPS/targets/SCF/2AST/MD/parallelise_itioc
> >>> >> >> " for
> >>> >> >> 5.
> >>> >> >> Charmrun> Starting ssh itioc5 -l douglas /bin/sh -f
> >>> >> >> Charmrun> remote shell (itioc5:5) started
> >>> >> >> Charmrun> Sending "6 129.215.237.187 58330 19205 0" to client
> 6.
> >>> >> >> Charmrun> find the node program
> >>> >> >> "/usr/people/douglas/programs/NAMD_2.9_Linux-x86/namd2" at
> >>> >> >>
> >>> >>
> >>>
> "/usr/people/douglas/projects/UPS/targets/SCF/2AST/MD/parallelise_itioc
> >>> >> >> " for
> >>> >> >> 6.
> >>> >> >> Charmrun> Starting ssh itioc1 -l douglas /bin/sh -f
> >>> >> >> Charmrun> remote shell (itioc1:6) started
> >>> >> >> Charmrun> Sending "7 129.215.237.187 58330 19205 0" to client
> 7.
> >>> >> >> Charmrun> find the node program
> >>> >> >> "/usr/people/douglas/programs/NAMD_2.9_Linux-x86/namd2" at
> >>> >> >>
> >>> >>
> >>>
> "/usr/people/douglas/projects/UPS/targets/SCF/2AST/MD/parallelise_itioc
> >>> >> >> " for
> >>> >> >> 7.
> >>> >> >> Charmrun> Starting ssh itioc5 -l douglas /bin/sh -f
> >>> >> >> Charmrun> remote shell (itioc5:7) started
> >>> >> >> Charmrun> Sending "8 129.215.237.187 58330 19205 0" to client
> 8.
> >>> >> >> Charmrun> find the node program
> >>> >> >> "/usr/people/douglas/programs/NAMD_2.9_Linux-x86/namd2" at
> >>> >> >>
> >>> >>
> >>>
> "/usr/people/douglas/projects/UPS/targets/SCF/2AST/MD/parallelise_itioc
> >>> >> >> " for
> >>> >> >> 8.
> >>> >> >> Charmrun> Starting ssh itioc1 -l douglas /bin/sh -f
> >>> >> >> Charmrun> remote shell (itioc1:8) started
> >>> >> >> Charmrun> Sending "9 129.215.237.187 58330 19205 0" to client
> 9.
> >>> >> >> Charmrun> find the node program
> >>> >> >> "/usr/people/douglas/programs/NAMD_2.9_Linux-x86/namd2" at
> >>> >> >>
> >>> >>
> >>>
> "/usr/people/douglas/projects/UPS/targets/SCF/2AST/MD/parallelise_itioc
> >>> >> >> " for
> >>> >> >> 9.
> >>> >> >> Charmrun> Starting ssh itioc5 -l douglas /bin/sh -f
> >>> >> >> Charmrun> remote shell (itioc5:9) started
> >>> >> >> Charmrun> Sending "10 129.215.237.187 58330 19205 0" to
> client
> >>> 10.
> >>> >> >> Charmrun> find the node program
> >>> >> >> "/usr/people/douglas/programs/NAMD_2.9_Linux-x86/namd2" at
> >>> >> >>
> >>> >>
> >>>
> "/usr/people/douglas/projects/UPS/targets/SCF/2AST/MD/parallelise_itioc
> >>> >> >> " for
> >>> >> >> 10.
> >>> >> >> Charmrun> Starting ssh itioc1 -l douglas /bin/sh -f
> >>> >> >> Charmrun> remote shell (itioc1:10) started
> >>> >> >> Charmrun> Sending "11 129.215.237.187 58330 19205 0" to
> client
> >>> 11.
> >>> >> >> Charmrun> find the node program
> >>> >> >> "/usr/people/douglas/programs/NAMD_2.9_Linux-x86/namd2" at
> >>> >> >>
> >>> >>
> >>>
> "/usr/people/douglas/projects/UPS/targets/SCF/2AST/MD/parallelise_itioc
> >>> >> >> " for
> >>> >> >> 11.
> >>> >> >> Charmrun> Starting ssh itioc5 -l douglas /bin/sh -f
> >>> >> >> Charmrun> remote shell (itioc5:11) started
> >>> >> >> Charmrun> node programs all started
> >>> >> >> Charmrun remote shell(itioc5.3)> remote responding...
> >>> >> >> Charmrun remote shell(itioc5.5)> remote responding...
> >>> >> >> Charmrun remote shell(itioc5.3)> starting node-program...
> >>> >> >> Charmrun remote shell(itioc5.5)> starting node-program...
> >>> >> >> Charmrun remote shell(itioc5.3)> rsh phase successful.
> >>> >> >> Charmrun remote shell(itioc5.5)> rsh phase successful.
> >>> >> >> Charmrun remote shell(itioc5.9)> remote responding...
> >>> >> >> Charmrun remote shell(itioc5.7)> remote responding...
> >>> >> >> Charmrun remote shell(itioc5.11)> remote responding...
> >>> >> >> Charmrun remote shell(itioc5.1)> remote responding...
> >>> >> >> Charmrun remote shell(itioc5.9)> starting node-program...
> >>> >> >> Charmrun remote shell(itioc5.7)> starting node-program...
> >>> >> >> Charmrun remote shell(itioc5.9)> rsh phase successful.
> >>> >> >> Charmrun remote shell(itioc5.7)> rsh phase successful.
> >>> >> >> Charmrun remote shell(itioc5.11)> starting node-program...
> >>> >> >> Charmrun remote shell(itioc5.1)> starting node-program...
> >>> >> >> Charmrun remote shell(itioc5.11)> rsh phase successful.
> >>> >> >> Charmrun remote shell(itioc5.1)> rsh phase successful.
> >>> >> >> Charmrun remote shell(itioc1.10)> remote responding...
> >>> >> >> Charmrun remote shell(itioc1.0)> remote responding...
> >>> >> >> Charmrun remote shell(itioc1.4)> remote responding...
> >>> >> >> Charmrun remote shell(itioc1.10)> starting node-program...
> >>> >> >> Charmrun remote shell(itioc1.10)> rsh phase successful.
> >>> >> >> Charmrun remote shell(itioc1.0)> starting node-program...
> >>> >> >> Charmrun remote shell(itioc1.0)> rsh phase successful.
> >>> >> >> Charmrun remote shell(itioc1.4)> starting node-program...
> >>> >> >> Charmrun remote shell(itioc1.4)> rsh phase successful.
> >>> >> >> Charmrun remote shell(itioc1.2)> remote responding...
> >>> >> >> Charmrun remote shell(itioc1.6)> remote responding...
> >>> >> >> Charmrun remote shell(itioc1.8)> remote responding...
> >>> >> >> Charmrun remote shell(itioc1.2)> starting node-program...
> >>> >> >> Charmrun remote shell(itioc1.2)> rsh phase successful.
> >>> >> >> Charmrun remote shell(itioc1.6)> starting node-program...
> >>> >> >> Charmrun remote shell(itioc1.6)> rsh phase successful.
> >>> >> >> Charmrun remote shell(itioc1.8)> starting node-program...
> >>> >> >> Charmrun remote shell(itioc1.8)> rsh phase successful.
> >>> >> >> Charmrun> Waiting for 0-th client to connect.
> >>> >> >> Charmrun> error 0 attaching to node:
> >>> >> >> Timeout waiting for node-program to connect
> >>> >> >>
> >>> >> >>
> >>> >> >> I'm not sure but I think the "Starting ssh itioc5 -l douglas
> >>> /bin/sh
> >>> >> >> -f" lines has something to do with it. If I run the command
> "ssh
> >>> >> >> itioc5 -l douglas /bin/sh -f" it also hangs. If I run "ssh
> itioc5
> >>> -l
> >>> >> >> douglas" then it logs me in just fine (without asking for a
> >>> >> password).
> >>> >> >> Similarly the command "ssh itioc5 -l douglas -f pwd" works
> fine,
> >>> >> with
> >>> >> >> the expected directory name returned.
> >>> >> >>
> >>> >> >> What exactly is happening at the "Waiting for 0-th client to
> >>> >> connect."
> >>> >> >> stage?
> >>> >> >>
> >>> >> >> Many thanks in advance for your thoughts.
> >>> >> >>
> >>> >> >> cheers,
> >>> >> >>
> >>> >> >> Doug
> >>> >> >>
> >>> >> >> _____________________________________________________
> >>> >> >> Dr. Douglas R. Houston
> >>> >> >> Lecturer
> >>> >> >> Institute of Structural and Molecular Biology
> >>> >> >> Room 3.23, Michael Swann Building
> >>> >> >> King's Buildings
> >>> >> >> University of Edinburgh
> >>> >> >> Edinburgh, EH9 3JR, UK
> >>> >> >> Tel. 0131 650 7358
> >>> >> >> http://tinyurl.com/douglasrhouston
> >>> >> >>
> >>> >> >> --
> >>> >> >> The University of Edinburgh is a charitable body, registered
> in
> >>> >> >> Scotland, with registration number SC005336.
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> > ---
> >>> >> > Diese E-Mail ist frei von Viren und Malware, denn der avast!
> >>> >> > Antivirus Schutz ist aktiv.
> >>> >> > http://www.avast.com
> >
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >> _____________________________________________________
> >>> >> Dr. Douglas R. Houston
> >>> >> Lecturer
> >>> >> Institute of Structural and Molecular Biology
> >>> >> Room 3.23, Michael Swann Building
> >>> >> King's Buildings
> >>> >> University of Edinburgh
> >>> >> Edinburgh, EH9 3JR, UK
> >>> >> Tel. 0131 650 7358
> >>> >> http://tinyurl.com/douglasrhouston
> >>> >>
> >>> >> --
> >>> >> The University of Edinburgh is a charitable body, registered in
> >>> >> Scotland, with registration number SC005336.
> >>> >
> >>> >
> >>> >
> >>> > ---
> >>> > Diese E-Mail ist frei von Viren und Malware, denn der avast!
> >>> > Antivirus Schutz ist aktiv.
> >>> > http://www.avast.com
> >>> >
> >>> >
> >>> >
> >>>
> >>>
> >>>
> >>>
> >>> _____________________________________________________
> >>> Dr. Douglas R. Houston
> >>> Lecturer
> >>> Institute of Structural and Molecular Biology
> >>> Room 3.23, Michael Swann Building
> >>> King's Buildings
> >>> University of Edinburgh
> >>> Edinburgh, EH9 3JR, UK
> >>> Tel. 0131 650 7358
> >>> http://tinyurl.com/douglasrhouston
> >>>
> >>> --
> >>> The University of Edinburgh is a charitable body, registered in
> >>> Scotland, with registration number SC005336.
> >>
> >>
> >>
> >> ---
> >> Diese E-Mail ist frei von Viren und Malware, denn der avast!
> >> Antivirus Schutz ist aktiv.
> >> http://www.avast.com
> >>
> >>
> >>
> >
> >
> >
> >
> > _____________________________________________________
> > Dr. Douglas R. Houston
> > Lecturer
> > Institute of Structural and Molecular Biology
> > Room 3.23, Michael Swann Building
> > King's Buildings
> > University of Edinburgh
> > Edinburgh, EH9 3JR, UK
> > Tel. 0131 650 7358
> > http://tinyurl.com/douglasrhouston
> >
> > --
> > The University of Edinburgh is a charitable body, registered in
> > Scotland, with registration number SC005336.
>
>
>
>
> _____________________________________________________
> Dr. Douglas R. Houston
> Lecturer
> Institute of Structural and Molecular Biology
> Room 3.23, Michael Swann Building
> King's Buildings
> University of Edinburgh
> Edinburgh, EH9 3JR, UK
> Tel. 0131 650 7358
> http://tinyurl.com/douglasrhouston
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.

---
Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv.
http://www.avast.com

This archive was generated by hypermail 2.1.6 : Thu Dec 31 2015 - 23:20:41 CST