IBverbs charmrun problem

From: Seren Soner (seren.soner_at_gmail.com)
Date: Tue Mar 08 2011 - 05:41:33 CST

Dear All,

I've been recently having some problems with charmrun on our Infiniband
system. Everything worked perfectly until a few days back, I dont know what
has changed, but there seems to be a problem in the initialization step.

I've been trying the script that I had written for SGE submission. The
execution write is as follows;

charmrun ++verbose ++nodelist charmlist ++p $NSLOTS ++remote-shell ssh
$NAMD_PATH/namd2 conf > log

Here's the output;

Charmrun> charmrun started...
Charmrun> using charmlist as nodesfile
Charmrun> remote shell (compute-0-15.local:0) started
Charmrun> remote shell (compute-0-15.local:1) started
Charmrun> remote shell (compute-0-15.local:2) started
Charmrun> remote shell (compute-0-15.local:3) started
Charmrun> remote shell (compute-0-15.local:4) started
Charmrun> remote shell (compute-0-15.local:5) started
Charmrun> remote shell (compute-0-15.local:6) started
Charmrun> remote shell (compute-0-15.local:7) started
Charmrun> remote shell (compute-0-15.local:8) started
Charmrun> remote shell (compute-0-15.local:9) started
Charmrun> remote shell (compute-0-15.local:10) started
Charmrun> remote shell (compute-0-15.local:11) started
Charmrun> remote shell (compute-0-13.local:12) started
Charmrun> remote shell (compute-0-13.local:13) started
Charmrun> remote shell (compute-0-13.local:14) started
Charmrun> remote shell (compute-0-13.local:15) started
Charmrun> remote shell (compute-0-13.local:16) started
Charmrun> remote shell (compute-0-13.local:17) started
Charmrun> remote shell (compute-0-13.local:18) started
Charmrun> remote shell (compute-0-13.local:19) started
Charmrun> remote shell (compute-0-13.local:20) started
Charmrun> remote shell (compute-0-13.local:21) started
Charmrun> remote shell (compute-0-13.local:22) started
Charmrun> node programs all started
Charmrun> error 93620 attaching to node:
Socket closed before recv.

And here's the logfile;

Charmrun> adding client 0: "compute-0-15.local", IP:10.1.255.239
Charmrun> adding client 1: "compute-0-15.local", IP:10.1.255.239
Charmrun> adding client 2: "compute-0-15.local", IP:10.1.255.239
Charmrun> adding client 3: "compute-0-15.local", IP:10.1.255.239
Charmrun> adding client 4: "compute-0-15.local", IP:10.1.255.239
Charmrun> adding client 5: "compute-0-15.local", IP:10.1.255.239
Charmrun> adding client 6: "compute-0-15.local", IP:10.1.255.239
Charmrun> adding client 7: "compute-0-15.local", IP:10.1.255.239
Charmrun> adding client 8: "compute-0-15.local", IP:10.1.255.239
Charmrun> adding client 9: "compute-0-15.local", IP:10.1.255.239
Charmrun> adding client 10: "compute-0-15.local", IP:10.1.255.239
Charmrun> adding client 11: "compute-0-15.local", IP:10.1.255.239
Charmrun> adding client 12: "compute-0-13.local", IP:10.1.255.241
Charmrun> adding client 13: "compute-0-13.local", IP:10.1.255.241
Charmrun> adding client 14: "compute-0-13.local", IP:10.1.255.241
Charmrun> adding client 15: "compute-0-13.local", IP:10.1.255.241
Charmrun> adding client 16: "compute-0-13.local", IP:10.1.255.241
Charmrun> adding client 17: "compute-0-13.local", IP:10.1.255.241
Charmrun> adding client 18: "compute-0-13.local", IP:10.1.255.241
Charmrun> adding client 19: "compute-0-13.local", IP:10.1.255.241
Charmrun> adding client 20: "compute-0-13.local", IP:10.1.255.241
Charmrun> adding client 21: "compute-0-13.local", IP:10.1.255.241
Charmrun> adding client 22: "compute-0-13.local", IP:10.1.255.241
Charmrun> Charmrun = 10.1.255.239, port = 54856
Charmrun> IBVERBS version of charmrun
Charmrun> Sending "0 10.1.255.239 54856 13499 0" to client 0.
Charmrun> find the node program
"/share/apps/NAMD_2.7_Linux-x86_64-ibverbs/namd2
" at "/home/seren/namd_deneme/1" for 0.
Charmrun> Starting ssh compute-0-15.local -l seren /bin/sh -f
Charmrun> Sending "1 10.1.255.239 54856 13499 0" to client 1.
Charmrun> find the node program
"/share/apps/NAMD_2.7_Linux-x86_64-ibverbs/namd2
" at "/home/seren/namd_deneme/1" for 1.
Charmrun> Starting ssh compute-0-15.local -l seren /bin/sh -f
Charmrun> Sending "2 10.1.255.239 54856 13499 0" to client 2.
Charmrun> find the node program
"/share/apps/NAMD_2.7_Linux-x86_64-ibverbs/namd2
" at "/home/seren/namd_deneme/1" for 2.
Charmrun> Starting ssh compute-0-15.local -l seren /bin/sh -f
Charmrun> Sending "3 10.1.255.239 54856 13499 0" to client 3.
Charmrun> find the node program
"/share/apps/NAMD_2.7_Linux-x86_64-ibverbs/namd2
" at "/home/seren/namd_deneme/1" for 3.
Charmrun> Starting ssh compute-0-15.local -l seren /bin/sh -f
Charmrun> Sending "4 10.1.255.239 54856 13499 0" to client 4.
Charmrun> find the node program
"/share/apps/NAMD_2.7_Linux-x86_64-ibverbs/namd2
" at "/home/seren/namd_deneme/1" for 4.
Charmrun> Starting ssh compute-0-15.local -l seren /bin/sh -f
Charmrun> Sending "5 10.1.255.239 54856 13499 0" to client 5.
Charmrun> find the node program
"/share/apps/NAMD_2.7_Linux-x86_64-ibverbs/namd2
" at "/home/seren/namd_deneme/1" for 5.
Charmrun> Starting ssh compute-0-15.local -l seren /bin/sh -f
Charmrun> Sending "6 10.1.255.239 54856 13499 0" to client 6.
Charmrun> find the node program
"/share/apps/NAMD_2.7_Linux-x86_64-ibverbs/namd2
" at "/home/seren/namd_deneme/1" for 6.
Charmrun> Starting ssh compute-0-15.local -l seren /bin/sh -f
Charmrun> Sending "7 10.1.255.239 54856 13499 0" to client 7.
Charmrun> find the node program
"/share/apps/NAMD_2.7_Linux-x86_64-ibverbs/namd2
" at "/home/seren/namd_deneme/1" for 7.
Charmrun> Starting ssh compute-0-15.local -l seren /bin/sh -f
Charmrun> Sending "8 10.1.255.239 54856 13499 0" to client 8.
Charmrun> find the node program
"/share/apps/NAMD_2.7_Linux-x86_64-ibverbs/namd2
" at "/home/seren/namd_deneme/1" for 8.
Charmrun> Starting ssh compute-0-15.local -l seren /bin/sh -f
Charmrun> Sending "9 10.1.255.239 54856 13499 0" to client 9.
Charmrun> find the node program
"/share/apps/NAMD_2.7_Linux-x86_64-ibverbs/namd2
" at "/home/seren/namd_deneme/1" for 9.
Charmrun> Starting ssh compute-0-15.local -l seren /bin/sh -f
Charmrun> Sending "10 10.1.255.239 54856 13499 0" to client 10Charmrun
remote sh
ell(compute-0-15.local.1)> remote responding...
Charmrun remote shell(compute-0-15.local.0)> remote responding...
Charmrun remote shell(compute-0-15.local.2)> remote responding...
Charmrun remote shell(compute-0-15.local.7)> remote responding...
Charmrun remote shell(compute-0-15.local.3)> remote responding...
Charmrun remote shell(compute-0-15.local.4)> remote responding...
Charmrun remote shell(compute-0-15.local.1)> starting node-program...
Charmrun remote shell(compute-0-15.local.1)> rsh phase successful.
Charmrun remote shell(compute-0-13.local.17)> remote responding...
Charmrun remote shell(compute-0-15.local.2)> starting node-program...
Charmrun remote shell(compute-0-15.local.2)> rsh phase successful.
Charmrun remote shell(compute-0-13.local.13)> remote responding...
Charmrun remote shell(compute-0-13.local.18)> remote responding...
Charmrun remote shell(compute-0-13.local.12)> remote responding...
Charmrun remote shell(compute-0-15.local.0)> starting node-program...
Charmrun remote shell(compute-0-15.local.0)> rsh phase successful.
Charmrun remote shell(compute-0-15.local.3)> starting node-program...
Charmrun remote shell(compute-0-15.local.3)> rsh phase successful.
Charmrun remote shell(compute-0-15.local.4)> starting node-program...
Charmrun remote shell(compute-0-15.local.4)> rsh phase successful.
Charmrun remote shell(compute-0-13.local.20)> remote responding...
Charmrun remote shell(compute-0-15.local.5)> remote responding...
Charmrun remote shell(compute-0-13.local.19)> remote responding...
Charmrun remote shell(compute-0-15.local.7)> starting node-program...
Charmrun remote shell(compute-0-15.local.7)> rsh phase successful.
Charmrun remote shell(compute-0-13.local.20)> starting node-program...
Charmrun remote shell(compute-0-13.local.20)> rsh phase successful.
Charmrun remote shell(compute-0-13.local.17)> starting node-program...
Charmrun remote shell(compute-0-13.local.17)> rsh phase successful.
Charmrun remote shell(compute-0-13.local.13)> starting node-program...
Charmrun remote shell(compute-0-13.local.13)> rsh phase successful.
Charmrun remote shell(compute-0-13.local.18)> starting node-program...
Charmrun remote shell(compute-0-13.local.18)> rsh phase successful.
Charmrun remote shell(compute-0-13.local.12)> starting node-program...
Charmrun remote shell(compute-0-13.local.12)> rsh phase successful.
Charmrun remote shell(compute-0-15.local.5)> starting node-program...
Charmrun remote shell(compute-0-15.local.5)> rsh phase successful.
Charmrun remote shell(compute-0-15.local.11)> remote responding...
Charmrun remote shell(compute-0-13.local.19)> starting node-program...
Charmrun remote shell(compute-0-13.local.19)> rsh phase successful.
Charmrun remote shell(compute-0-15.local.11)> starting node-program...
Charmrun remote shell(compute-0-15.local.11)> rsh phase successful.
Charmrun remote shell(compute-0-13.local.14)> remote responding...
Charmrun remote shell(compute-0-15.local.9)> remote responding...
Charmrun remote shell(compute-0-15.local.6)> remote responding...
Charmrun remote shell(compute-0-15.local.8)> remote responding...
Charmrun remote shell(compute-0-13.local.14)> starting node-program...
Charmrun remote shell(compute-0-13.local.14)> rsh phase successful.
Charmrun remote shell(compute-0-15.local.9)> starting node-program...
Charmrun remote shell(compute-0-15.local.9)> rsh phase successful.
Charmrun remote shell(compute-0-13.local.15)> remote responding...
Charmrun remote shell(compute-0-15.local.10)> remote responding...
Charmrun remote shell(compute-0-15.local.6)> starting node-program...
Charmrun remote shell(compute-0-15.local.6)> rsh phase successful.
Charmrun remote shell(compute-0-15.local.8)> starting node-program...
Charmrun remote shell(compute-0-15.local.8)> rsh phase successful.
Charmrun remote shell(compute-0-15.local.10)> starting node-program...
Charmrun remote shell(compute-0-15.local.10)> rsh phase successful.
Charmrun remote shell(compute-0-13.local.15)> starting node-program...
Charmrun remote shell(compute-0-13.local.15)> rsh phase successful.
Charmrun remote shell(compute-0-13.local.21)> remote responding...
Charmrun remote shell(compute-0-13.local.22)> remote responding...
Charmrun remote shell(compute-0-13.local.16)> remote responding...
Charmrun remote shell(compute-0-13.local.21)> starting node-program...
Charmrun remote shell(compute-0-13.local.21)> rsh phase successful.
Charmrun remote shell(compute-0-13.local.22)> starting node-program...
Charmrun remote shell(compute-0-13.local.22)> rsh phase successful.
Charmrun remote shell(compute-0-13.local.16)> starting node-program...
Charmrun remote shell(compute-0-13.local.16)> rsh phase successful.

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:19:54 CST