Help: NAMD on cluster; nodes reject multiple connections

From: David Kelly (davidkelly_at_sgi.com)
Date: Thu Jul 01 2010 - 00:33:11 CDT

Hi,

 

I'm trying to run NAMD 2.7b2 on a GbE network of 3 servers running
Novell SLES 11 SP1 x64 Linux. Each server contains 2 x Intel X5570
quad-core CPUs, 24GB RAM, 2 x GbE ports, and a SATA disk. With
Hyperthreading, each server appears to have 16 cores (actually, only 8
physical cores).

 

My nodelist file is:

 

group main ++shell ssh ++ppn 8

host octane3-hn

host octane3-cn1

host octane3-cn3

 

I have also tried omitting the above "++shell ssh ++ppn 8".

I have $CONV_RSH set to "ssh".

 

I'm using charmrun with ssh connections rather than rsh connections. I
have everything set up so I can log in to the workstation/cluster head
node (octane3-hn) and enter "ssh octane3-cn1 hostname" and the target
server will execute the command without requiring a password. I can ssh
without a password to both the "octane3-cn1" and "octane3-cn3" servers.

 

I'm trying to run 24 processes (= 3 x servers with 8 physical cores
each). The ssh connections *SEEM* to be OK, but subsequently I can't get
more than around 16 connections going - I get a message such as:

Charmrun> error 15 attaching to node:

Timeout waiting for node-program to connect

 

I have tried increasing the number of sessions/connections allowed by
each server's "sshd" daemon and I have tried using charmrun's "++maxrsh"
argument - which apparently defaults to 16, which sounds suspicious. I
even tried increasing "ulimit -n 2048". Nothing seems to make a
difference.

 

The output from a "++verbose" charmrun is below. I've scanned the NAMD
Wiki troubleshooting info and known problems, etc. I've seen issues
where connections are rejected, but nothing where only more than "N"
connections are rejected. I'd greatly appreciate advice as to where I
should be looking to overcome this problem.

 

Thanks and regards,

David (Kelly)

 

 

dk_at_octane3-hn:~/NAMD_2.7b2_Linux-x86_64-TCP> charmrun namd2 ++verbose
++maxrsh 24 +p24 Input_Files_SGI_test/afp.config

Charmrun> charmrun started...

Charmrun> using ./nodelist as nodesfile

Charmrun> adding client 0: "octane3-hn", IP:10.0.40.20

Charmrun> adding client 1: "octane3-cn1", IP:10.0.40.30

Charmrun> adding client 2: "octane3-cn3", IP:10.0.40.50

Charmrun> adding client 3: "octane3-hn", IP:10.0.40.20

Charmrun> adding client 4: "octane3-cn1", IP:10.0.40.30

Charmrun> adding client 5: "octane3-cn3", IP:10.0.40.50

Charmrun> adding client 6: "octane3-hn", IP:10.0.40.20

Charmrun> adding client 7: "octane3-cn1", IP:10.0.40.30

Charmrun> adding client 8: "octane3-cn3", IP:10.0.40.50

Charmrun> adding client 9: "octane3-hn", IP:10.0.40.20

Charmrun> adding client 10: "octane3-cn1", IP:10.0.40.30

Charmrun> adding client 11: "octane3-cn3", IP:10.0.40.50

Charmrun> adding client 12: "octane3-hn", IP:10.0.40.20

Charmrun> adding client 13: "octane3-cn1", IP:10.0.40.30

Charmrun> adding client 14: "octane3-cn3", IP:10.0.40.50

Charmrun> adding client 15: "octane3-hn", IP:10.0.40.20

Charmrun> adding client 16: "octane3-cn1", IP:10.0.40.30

Charmrun> adding client 17: "octane3-cn3", IP:10.0.40.50

Charmrun> adding client 18: "octane3-hn", IP:10.0.40.20

Charmrun> adding client 19: "octane3-cn1", IP:10.0.40.30

Charmrun> adding client 20: "octane3-cn3", IP:10.0.40.50

Charmrun> adding client 21: "octane3-hn", IP:10.0.40.20

Charmrun> adding client 22: "octane3-cn1", IP:10.0.40.30

Charmrun> adding client 23: "octane3-cn3", IP:10.0.40.50

Charmrun> Charmrun = 10.0.40.20, port = 41446

Charmrun> Sending "0 10.0.40.20 41446 17640 0" to client 0.

Charmrun> find the node program
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP/namd2" at
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP" for 0.

Charmrun> Starting ssh octane3-hn -l dk /bin/sh -f

Charmrun> remote shell (octane3-hn:0) started

Charmrun> Sending "1 10.0.40.20 41446 17640 0" to client 1.

Charmrun> find the node program
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP/namd2" at
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP" for 1.

Charmrun> Starting ssh octane3-cn1 -l dk /bin/sh -f

Charmrun> remote shell (octane3-cn1:1) started

Charmrun> Sending "2 10.0.40.20 41446 17640 0" to client 2.

Charmrun> find the node program
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP/namd2" at
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP" for 2.

Charmrun> Starting ssh octane3-cn3 -l dk /bin/sh -f

Charmrun> remote shell (octane3-cn3:2) started

Charmrun> Sending "3 10.0.40.20 41446 17640 0" to client 3.

Charmrun> find the node program
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP/namd2" at
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP" for 3.

Charmrun> Starting ssh octane3-hn -l dk /bin/sh -f

Charmrun> remote shell (octane3-hn:3) started

Charmrun> Sending "4 10.0.40.20 41446 17640 0" to client 4.

Charmrun> find the node program
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP/namd2" at
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP" for 4.

Charmrun> Starting ssh octane3-cn1 -l dk /bin/sh -f

Charmrun> remote shell (octane3-cn1:4) started

Charmrun> Sending "5 10.0.40.20 41446 17640 0" to client 5.

Charmrun> find the node program
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP/namd2" at
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP" for 5.

Charmrun> Starting ssh octane3-cn3 -l dk /bin/sh -f

Charmrun> remote shell (octane3-cn3:5) started

Charmrun> Sending "6 10.0.40.20 41446 17640 0" to client 6.

Charmrun> find the node program
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP/namd2" at
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP" for 6.

Charmrun> Starting ssh octane3-hn -l dk /bin/sh -f

Charmrun> remote shell (octane3-hn:6) started

Charmrun> Sending "7 10.0.40.20 41446 17640 0" to client 7.

Charmrun> find the node program
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP/namd2" at
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP" for 7.

Charmrun> Starting ssh octane3-cn1 -l dk /bin/sh -f

Charmrun> remote shell (octane3-cn1:7) started

Charmrun> Sending "8 10.0.40.20 41446 17640 0" to client 8.

Charmrun> find the node program
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP/namd2" at
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP" for 8.

Charmrun> Starting ssh octane3-cn3 -l dk /bin/sh -f

Charmrun> remote shell (octane3-cn3:8) started

Charmrun> Sending "9 10.0.40.20 41446 17640 0" to client 9.

Charmrun> find the node program
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP/namd2" at
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP" for 9.

Charmrun> Starting ssh octane3-hn -l dk /bin/sh -f

Charmrun> remote shell (octane3-hn:9) started

Charmrun> Sending "10 10.0.40.20 41446 17640 0" to client 10.

Charmrun> find the node program
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP/namd2" at
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP" for 10.

Charmrun> Starting ssh octane3-cn1 -l dk /bin/sh -f

Charmrun> remote shell (octane3-cn1:10) started

Charmrun> Sending "11 10.0.40.20 41446 17640 0" to client 11.

Charmrun> find the node program
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP/namd2" at
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP" for 11.

Charmrun> Starting ssh octane3-cn3 -l dk /bin/sh -f

Charmrun> remote shell (octane3-cn3:11) started

Charmrun> Sending "12 10.0.40.20 41446 17640 0" to client 12.

Charmrun> find the node program
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP/namd2" at
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP" for 12.

Charmrun> Starting ssh octane3-hn -l dk /bin/sh -f

Charmrun> remote shell (octane3-hn:12) started

Charmrun> Sending "13 10.0.40.20 41446 17640 0" to client 13.

Charmrun> find the node program
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP/namd2" at
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP" for 13.

Charmrun> Starting ssh octane3-cn1 -l dk /bin/sh -f

Charmrun> remote shell (octane3-cn1:13) started

Charmrun> Sending "14 10.0.40.20 41446 17640 0" to client 14.

Charmrun> find the node program
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP/namd2" at
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP" for 14.

Charmrun> Starting ssh octane3-cn3 -l dk /bin/sh -f

Charmrun> remote shell (octane3-cn3:14) started

Charmrun> Sending "15 10.0.40.20 41446 17640 0" to client 15.

Charmrun> find the node program
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP/namd2" at
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP" for 15.

Charmrun> Starting ssh octane3-hn -l dk /bin/sh -f

Charmrun> remote shell (octane3-hn:15) started

Charmrun> Sending "16 10.0.40.20 41446 17640 0" to client 16.

Charmrun> find the node program
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP/namd2" at
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP" for 16.

Charmrun> Starting ssh octane3-cn1 -l dk /bin/sh -f

Charmrun> remote shell (octane3-cn1:16) started

Charmrun> Sending "17 10.0.40.20 41446 17640 0" to client 17.

Charmrun> find the node program
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP/namd2" at
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP" for 17.

Charmrun> Starting ssh octane3-cn3 -l dk /bin/sh -f

Charmrun> remote shell (octane3-cn3:17) started

Charmrun> Sending "18 10.0.40.20 41446 17640 0" to client 18.

Charmrun> find the node program
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP/namd2" at
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP" for 18.

Charmrun> Starting ssh octane3-hn -l dk /bin/sh -f

Charmrun> remote shell (octane3-hn:18) started

Charmrun> Sending "19 10.0.40.20 41446 17640 0" to client 19.

Charmrun> find the node program
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP/namd2" at
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP" for 19.

Charmrun> Starting ssh octane3-cn1 -l dk /bin/sh -f

Charmrun> remote shell (octane3-cn1:19) started

Charmrun> Sending "20 10.0.40.20 41446 17640 0" to client 20.

Charmrun> find the node program
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP/namd2" at
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP" for 20.

Charmrun> Starting ssh octane3-cn3 -l dk /bin/sh -f

Charmrun> remote shell (octane3-cn3:20) started

Charmrun> Sending "21 10.0.40.20 41446 17640 0" to client 21.

Charmrun> find the node program
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP/namd2" at
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP" for 21.

Charmrun> Starting ssh octane3-hn -l dk /bin/sh -f

Charmrun> remote shell (octane3-hn:21) started

Charmrun> Sending "22 10.0.40.20 41446 17640 0" to client 22.

Charmrun> find the node program
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP/namd2" at
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP" for 22.

Charmrun> Starting ssh octane3-cn1 -l dk /bin/sh -f

Charmrun> remote shell (octane3-cn1:22) started

Charmrun> Sending "23 10.0.40.20 41446 17640 0" to client 23.

Charmrun> find the node program
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP/namd2" at
"/home/dk/NAMD_2.7b2_Linux-x86_64-TCP" for 23.

Charmrun> Starting ssh octane3-cn3 -l dk /bin/sh -f

Charmrun> remote shell (octane3-cn3:23) started

Charmrun> node programs all started

Charmrun remote shell(octane3-cn1.7)> remote responding...

Charmrun remote shell(octane3-cn1.10)> remote responding...

Charmrun remote shell(octane3-cn1.7)> starting node-program...

Charmrun remote shell(octane3-cn1.7)> rsh phase successful.

Charmrun remote shell(octane3-cn1.13)> remote responding...

Charmrun remote shell(octane3-cn1.10)> starting node-program...

Charmrun remote shell(octane3-cn1.10)> rsh phase successful.

Charmrun remote shell(octane3-cn1.13)> starting node-program...

Charmrun remote shell(octane3-cn1.13)> rsh phase successful.

Charmrun remote shell(octane3-cn1.4)> remote responding...

Charmrun remote shell(octane3-cn1.4)> starting node-program...

Charmrun remote shell(octane3-cn1.4)> rsh phase successful.

Charmrun remote shell(octane3-cn3.2)> remote responding...

Charmrun remote shell(octane3-cn3.2)> starting node-program...

Charmrun remote shell(octane3-cn3.2)> rsh phase successful.

Charmrun remote shell(octane3-cn1.1)> remote responding...

Charmrun remote shell(octane3-cn1.1)> starting node-program...

Charmrun remote shell(octane3-cn1.19)> remote responding...

Charmrun remote shell(octane3-cn1.1)> rsh phase successful.

Charmrun remote shell(octane3-cn1.16)> remote responding...

Charmrun remote shell(octane3-cn1.19)> starting node-program...

Charmrun remote shell(octane3-cn1.19)> rsh phase successful.

Charmrun remote shell(octane3-cn1.16)> starting node-program...

Charmrun remote shell(octane3-cn1.16)> rsh phase successful.

Charmrun remote shell(octane3-cn1.22)> remote responding...

Charmrun remote shell(octane3-cn1.22)> starting node-program...

Charmrun remote shell(octane3-cn1.22)> rsh phase successful.

Charmrun remote shell(octane3-hn.12)> remote responding...

Charmrun remote shell(octane3-hn.12)> starting node-program...

Charmrun remote shell(octane3-hn.12)> rsh phase successful.

Charmrun remote shell(octane3-cn3.14)> remote responding...

Charmrun remote shell(octane3-cn3.14)> starting node-program...

Charmrun remote shell(octane3-cn3.14)> rsh phase successful.

Charmrun remote shell(octane3-hn.9)> remote responding...

Charmrun remote shell(octane3-hn.9)> starting node-program...

Charmrun remote shell(octane3-hn.9)> rsh phase successful.

Charmrun remote shell(octane3-hn.6)> remote responding...

Charmrun remote shell(octane3-hn.0)> remote responding...

Charmrun remote shell(octane3-hn.6)> starting node-program...

Charmrun remote shell(octane3-hn.6)> rsh phase successful.

Charmrun remote shell(octane3-hn.0)> starting node-program...

Charmrun remote shell(octane3-hn.0)> rsh phase successful.

Charmrun remote shell(octane3-hn.21)> remote responding...

Charmrun remote shell(octane3-hn.21)> starting node-program...

Charmrun remote shell(octane3-hn.21)> rsh phase successful.

Charmrun remote shell(octane3-hn.15)> remote responding...

Charmrun remote shell(octane3-hn.15)> starting node-program...

Charmrun remote shell(octane3-hn.15)> rsh phase successful.

Charmrun remote shell(octane3-hn.3)> remote responding...

Charmrun remote shell(octane3-hn.3)> starting node-program...

Charmrun remote shell(octane3-hn.3)> rsh phase successful.

Charmrun remote shell(octane3-hn.18)> remote responding...

Charmrun remote shell(octane3-hn.18)> starting node-program...

Charmrun remote shell(octane3-hn.18)> rsh phase successful.

Charmrun remote shell(octane3-cn3.17)> remote responding...

Charmrun remote shell(octane3-cn3.20)> remote responding...

Charmrun remote shell(octane3-cn3.17)> starting node-program...

Charmrun remote shell(octane3-cn3.11)> remote responding...

Charmrun remote shell(octane3-cn3.23)> remote responding...

Charmrun remote shell(octane3-cn3.17)> rsh phase successful.

Charmrun remote shell(octane3-cn3.20)> starting node-program...

Charmrun remote shell(octane3-cn3.11)> starting node-program...

Charmrun remote shell(octane3-cn3.11)> rsh phase successful.

Charmrun remote shell(octane3-cn3.20)> rsh phase successful.

Charmrun remote shell(octane3-cn3.23)> starting node-program...

Charmrun remote shell(octane3-cn3.23)> rsh phase successful.

Charmrun remote shell(octane3-cn3.5)> remote responding...

Charmrun remote shell(octane3-cn3.5)> starting node-program...

Charmrun remote shell(octane3-cn3.5)> rsh phase successful.

Charmrun remote shell(octane3-cn3.8)> remote responding...

Charmrun remote shell(octane3-cn3.8)> starting node-program...

Charmrun remote shell(octane3-cn3.8)> rsh phase successful.

Charmrun> Waiting for 0-th client to connect.

Charmrun> Waiting for 1-th client to connect.

Charmrun> Waiting for 2-th client to connect.

Charmrun> Waiting for 3-th client to connect.

Charmrun> Waiting for 4-th client to connect.

Charmrun> Waiting for 5-th client to connect.

Charmrun> client 7 connected (IP=10.0.40.30 data_port=37316)

Charmrun> client 10 connected (IP=10.0.40.30 data_port=35760)

Charmrun> client 13 connected (IP=10.0.40.30 data_port=54795)

Charmrun> client 4 connected (IP=10.0.40.30 data_port=41389)

Charmrun> client 2 connected (IP=10.0.40.50 data_port=34996)

Charmrun> client 1 connected (IP=10.0.40.30 data_port=40891)

Charmrun> Waiting for 6-th client to connect.

Charmrun> Waiting for 7-th client to connect.

Charmrun> client 11 connected (IP=10.0.40.50 data_port=52794)

Charmrun> client 17 connected (IP=10.0.40.50 data_port=45846)

Charmrun> Waiting for 8-th client to connect.

Charmrun> client 5 connected (IP=10.0.40.50 data_port=43888)

Charmrun> Waiting for 9-th client to connect.

Charmrun> client 22 connected (IP=10.0.40.30 data_port=36375)

Charmrun> Waiting for 10-th client to connect.

Charmrun> client 19 connected (IP=10.0.40.30 data_port=58513)

Charmrun> Waiting for 11-th client to connect.

Charmrun> client 16 connected (IP=10.0.40.30 data_port=47003)

Charmrun> Waiting for 12-th client to connect.

Charmrun> client 8 connected (IP=10.0.40.50 data_port=40369)

Charmrun> Waiting for 13-th client to connect.

Charmrun> Waiting for 14-th client to connect.

Charmrun> client 23 connected (IP=10.0.40.50 data_port=49037)

Charmrun> client 20 connected (IP=10.0.40.50 data_port=54233)

Charmrun> Waiting for 15-th client to connect.

Charmrun> error 15 attaching to node:

Timeout waiting for node-program to connect

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 05:23:04 CST