From: Darko Stefanovski (stefanov_at_usc.edu)
Date: Thu May 05 2011 - 11:53:40 CDT
Hi All,
  
We are having some difficulties running NAMD on two identical machines  with three GTX 580 CUDAs (linked by three-way SLI bridge). Initially, we  had some issues with getting the run-time libraries to work using  LD_LIBRARY_PATH, but it seems that the problem was resolved using the runscript from the NAMD website. Still  charmrun is exiting with Error Code 1 on the host2. We execute the  command from host1. We were wondering if anybody has any ideas how weshould proceed?
  
  Best wishes,
  Darko Stefanovski
  
  P.S. Bellow you will find the log,
  
  Charmrun remote shell(host1_.4)>  remote responding...
  Charmrun remote shell(host1_.4)>  starting node-program...
  Charmrun remote shell(host1_.4)>  rsh phase successful.
  Charmrun remote shell(host1_.6)>  remote responding...
  Charmrun remote shell(host1_.6)>  starting node-program...
  Charmrun remote shell(host1_.6)>  rsh phase successful.
  Charmrun remote shell(host1_.2)>  remote responding...
  Charmrun remote shell(host1_.2)>  starting node-program...
  Charmrun remote shell(host1_.2)>  rsh phase successful.
  Charmrun remote shell(host1_.0)>  remote responding...
  Charmrun remote shell(host1_.0)>  starting node-program...
  Charmrun remote shell(host1_.0)>  rsh phase successful.
  Charmrun remote shell(host2_.7)>  remote responding...
  Charmrun remote shell(host2_.7)>  starting node-program...
  Charmrun remote shell(host2_.7)>  rsh phase successful.
  Charmrun remote shell(host2_.5)>  remote responding...
  Charmrun remote shell(host2_.5)>  starting node-program...
  Charmrun remote shell(host2_.5)>  rsh phase successful.
  Charmrun remote shell(host2_.3)>  remote responding...
  Charmrun remote shell(host2_.3)>  starting node-program...
  Charmrun remote shell(host2_.3)>  rsh phase successful.
  Charmrun remote shell(host2_.1)>  remote responding...
  Charmrun remote shell(host2_.1)>  starting node-program...
  Charmrun remote shell(host2_.1)>  rsh phase successful.
  Charmrun remote shell(host2_.7)>  Exiting with error code 1
  Charmrun remote shell(host2_.5)>  Exiting with error code 1
  Charmrun remote shell(host2_.3)>  Exiting with error code 1
  Charmrun remote shell(host2_.1)>  Exiting with error code 1
  Charmrun>  adding client 0: "host1_", IP:128.0.0.1
  Charmrun>  adding client 1: "host2_", IP:128.0.0.2
  Charmrun>  adding client 2: "host1_", IP:128.0.0.1
  Charmrun>  adding client 3: "host2_", IP:128.0.0.2
  Charmrun>  adding client 4: "host1_", IP:128.0.0.1
  Charmrun>  adding client 5: "host2_", IP:128.0.0.2
  Charmrun>  adding client 6: "host1_", IP:128.0.0.1
  Charmrun>  adding client 7: "host2_", IP:128.0.0.2
  Charmrun>  Charmrun = 128.0.0.1, port = 44971
  Charmrun>  Sending "0 128.0.0.1 44971 19009 0" to client 0.
  Charmrun>  find the node program "/home/robert/Documents/namd/host2" at
  "/home/robert/Documents/namd" for 0.
  Charmrun>  Starting rsh host1_ -l robert /bin/sh -f
  Charmrun>  Sending "1 128.0.0.1 44971 19009 0" to client 1.
  Charmrun>  find the node program "/home/robert/Documents/namd/host2" at
  "/home/robert/Documents/namd" for 1.
  Charmrun>  Starting rsh host2_ -l robert /bin/sh -f
  Charmrun>  Sending "2 128.0.0.1 44971 19009 0" to client 2.
  Charmrun>  find the node program "/home/robert/Documents/namd/host2" at
  "/home/robert/Documents/namd" for 2.
  Charmrun>  Starting rsh host1_ -l robert /bin/sh -f
  Charmrun>  Sending "3 128.0.0.1 44971 19009 0" to client 3.
  Charmrun>  find the node program "/home/robert/Documents/namd/host2" at
  "/home/robert/Documents/namd" for 3.
  Charmrun>  Starting rsh host2_ -l robert /bin/sh -f
  Charmrun>  Sending "4 128.0.0.1 44971 19009 0" to client 4.
  Charmrun>  find the node program "/home/robert/Documents/namd/host2" at
  "/home/robert/Documents/namd" for 4.
  Charmrun>  Starting rsh host1_ -l robert /bin/sh -f
  Charmrun>  Sending "5 128.0.0.1 44971 19009 0" to client 5.
  Charmrun>  find the node program "/home/robert/Documents/namd/host2" at
  "/home/robert/Documents/namd" for 5.
  Charmrun>  Starting rsh host2_ -l robert /bin/sh -f
  Charmrun>  Sending "6 128.0.0.1 44971 19009 0" to client 6.
  Charmrun>  find the node program "/home/robert/Documents/namd/host2" at
  "/home/robert/Documents/namd" for 6.
  Charmrun>  Starting rsh host1_ -l robert /bin/sh -f
  Charmrun>  Sending "7 128.0.0.1 44971 19009 0" to client 7.
  Charmrun>  find the node program "/home/robert/Documents/namd/host2" at
  "/home/robert/Documents/namd" for 7.
  Charmrun>  Starting rsh host2_ -l robert /bin/sh -f
This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:20:12 CST