Running NAMD parallel on two machines with 3 CUDAs

From: Darko Stefanovski (stefanov_at_usc.edu)
Date: Thu May 05 2011 - 11:53:40 CDT

Hi All,
  
We are having some difficulties running NAMD on two identical machines with three GTX 580 CUDAs (linked by three-way SLI bridge). Initially, we had some issues with getting the run-time libraries to work using LD_LIBRARY_PATH, but it seems that the problem was resolved using the runscript from the NAMD website. Still charmrun is exiting with Error Code 1 on the host2. We execute the command from host1. We were wondering if anybody has any ideas how weshould proceed?
  
  Best wishes,
  Darko Stefanovski
  
  P.S. Bellow you will find the log,
  
  Charmrun remote shell(host1_.4)> remote responding...
  Charmrun remote shell(host1_.4)> starting node-program...
  Charmrun remote shell(host1_.4)> rsh phase successful.
  Charmrun remote shell(host1_.6)> remote responding...
  Charmrun remote shell(host1_.6)> starting node-program...
  Charmrun remote shell(host1_.6)> rsh phase successful.
  Charmrun remote shell(host1_.2)> remote responding...
  Charmrun remote shell(host1_.2)> starting node-program...
  Charmrun remote shell(host1_.2)> rsh phase successful.
  Charmrun remote shell(host1_.0)> remote responding...
  Charmrun remote shell(host1_.0)> starting node-program...
  Charmrun remote shell(host1_.0)> rsh phase successful.
  Charmrun remote shell(host2_.7)> remote responding...
  Charmrun remote shell(host2_.7)> starting node-program...
  Charmrun remote shell(host2_.7)> rsh phase successful.
  Charmrun remote shell(host2_.5)> remote responding...
  Charmrun remote shell(host2_.5)> starting node-program...
  Charmrun remote shell(host2_.5)> rsh phase successful.
  Charmrun remote shell(host2_.3)> remote responding...
  Charmrun remote shell(host2_.3)> starting node-program...
  Charmrun remote shell(host2_.3)> rsh phase successful.
  Charmrun remote shell(host2_.1)> remote responding...
  Charmrun remote shell(host2_.1)> starting node-program...
  Charmrun remote shell(host2_.1)> rsh phase successful.
  Charmrun remote shell(host2_.7)> Exiting with error code 1
  Charmrun remote shell(host2_.5)> Exiting with error code 1
  Charmrun remote shell(host2_.3)> Exiting with error code 1
  Charmrun remote shell(host2_.1)> Exiting with error code 1
  Charmrun> adding client 0: "host1_", IP:128.0.0.1
  Charmrun> adding client 1: "host2_", IP:128.0.0.2
  Charmrun> adding client 2: "host1_", IP:128.0.0.1
  Charmrun> adding client 3: "host2_", IP:128.0.0.2
  Charmrun> adding client 4: "host1_", IP:128.0.0.1
  Charmrun> adding client 5: "host2_", IP:128.0.0.2
  Charmrun> adding client 6: "host1_", IP:128.0.0.1
  Charmrun> adding client 7: "host2_", IP:128.0.0.2
  Charmrun> Charmrun = 128.0.0.1, port = 44971
  Charmrun> Sending "0 128.0.0.1 44971 19009 0" to client 0.
  Charmrun> find the node program "/home/robert/Documents/namd/host2" at
  "/home/robert/Documents/namd" for 0.
  Charmrun> Starting rsh host1_ -l robert /bin/sh -f
  Charmrun> Sending "1 128.0.0.1 44971 19009 0" to client 1.
  Charmrun> find the node program "/home/robert/Documents/namd/host2" at
  "/home/robert/Documents/namd" for 1.
  Charmrun> Starting rsh host2_ -l robert /bin/sh -f
  Charmrun> Sending "2 128.0.0.1 44971 19009 0" to client 2.
  Charmrun> find the node program "/home/robert/Documents/namd/host2" at
  "/home/robert/Documents/namd" for 2.
  Charmrun> Starting rsh host1_ -l robert /bin/sh -f
  Charmrun> Sending "3 128.0.0.1 44971 19009 0" to client 3.
  Charmrun> find the node program "/home/robert/Documents/namd/host2" at
  "/home/robert/Documents/namd" for 3.
  Charmrun> Starting rsh host2_ -l robert /bin/sh -f
  Charmrun> Sending "4 128.0.0.1 44971 19009 0" to client 4.
  Charmrun> find the node program "/home/robert/Documents/namd/host2" at
  "/home/robert/Documents/namd" for 4.
  Charmrun> Starting rsh host1_ -l robert /bin/sh -f
  Charmrun> Sending "5 128.0.0.1 44971 19009 0" to client 5.
  Charmrun> find the node program "/home/robert/Documents/namd/host2" at
  "/home/robert/Documents/namd" for 5.
  Charmrun> Starting rsh host2_ -l robert /bin/sh -f
  Charmrun> Sending "6 128.0.0.1 44971 19009 0" to client 6.
  Charmrun> find the node program "/home/robert/Documents/namd/host2" at
  "/home/robert/Documents/namd" for 6.
  Charmrun> Starting rsh host1_ -l robert /bin/sh -f
  Charmrun> Sending "7 128.0.0.1 44971 19009 0" to client 7.
  Charmrun> find the node program "/home/robert/Documents/namd/host2" at
  "/home/robert/Documents/namd" for 7.
  Charmrun> Starting rsh host2_ -l robert /bin/sh -f

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:20:12 CST