AW: Running NAMD parallel on two machines with 3 CUDAs

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Fri May 06 2011 - 00:36:27 CDT

Hi Darko,

I had similar problems at the beginning. In General the charmrun should pass
through the environment configuration for librarys to all nodes. For me it
worked when I set the charmrun parameter ++local. So charmrun knows it
should also run on the machine the script startet from and passes the
LD_LIBRARY_PATH through to the other nodes from there.

Best regards

Norman Geist.

-----Ursprüngliche Nachricht-----
Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag
von Darko Stefanovski
Gesendet: Donnerstag, 5. Mai 2011 18:54
An: namd-l_at_ks.uiuc.edu
Betreff: namd-l: Running NAMD parallel on two machines with 3 CUDAs

Hi All,
  
We are having some difficulties running NAMD on two identical machines with
three GTX 580 CUDAs (linked by three-way SLI bridge). Initially, we had
some issues with getting the run-time libraries to work using
LD_LIBRARY_PATH, but it seems that the problem was resolved using the
runscript from the NAMD website. Still charmrun is exiting with Error Code
1 on the host2. We execute the command from host1. We were wondering if
anybody has any ideas how weshould proceed?
  
  Best wishes,
  Darko Stefanovski
  
  P.S. Bellow you will find the log,
  
  Charmrun remote shell(host1_.4)> remote responding...
  Charmrun remote shell(host1_.4)> starting node-program...
  Charmrun remote shell(host1_.4)> rsh phase successful.
  Charmrun remote shell(host1_.6)> remote responding...
  Charmrun remote shell(host1_.6)> starting node-program...
  Charmrun remote shell(host1_.6)> rsh phase successful.
  Charmrun remote shell(host1_.2)> remote responding...
  Charmrun remote shell(host1_.2)> starting node-program...
  Charmrun remote shell(host1_.2)> rsh phase successful.
  Charmrun remote shell(host1_.0)> remote responding...
  Charmrun remote shell(host1_.0)> starting node-program...
  Charmrun remote shell(host1_.0)> rsh phase successful.
  Charmrun remote shell(host2_.7)> remote responding...
  Charmrun remote shell(host2_.7)> starting node-program...
  Charmrun remote shell(host2_.7)> rsh phase successful.
  Charmrun remote shell(host2_.5)> remote responding...
  Charmrun remote shell(host2_.5)> starting node-program...
  Charmrun remote shell(host2_.5)> rsh phase successful.
  Charmrun remote shell(host2_.3)> remote responding...
  Charmrun remote shell(host2_.3)> starting node-program...
  Charmrun remote shell(host2_.3)> rsh phase successful.
  Charmrun remote shell(host2_.1)> remote responding...
  Charmrun remote shell(host2_.1)> starting node-program...
  Charmrun remote shell(host2_.1)> rsh phase successful.
  Charmrun remote shell(host2_.7)> Exiting with error code 1
  Charmrun remote shell(host2_.5)> Exiting with error code 1
  Charmrun remote shell(host2_.3)> Exiting with error code 1
  Charmrun remote shell(host2_.1)> Exiting with error code 1
  Charmrun> adding client 0: "host1_", IP:128.0.0.1
  Charmrun> adding client 1: "host2_", IP:128.0.0.2
  Charmrun> adding client 2: "host1_", IP:128.0.0.1
  Charmrun> adding client 3: "host2_", IP:128.0.0.2
  Charmrun> adding client 4: "host1_", IP:128.0.0.1
  Charmrun> adding client 5: "host2_", IP:128.0.0.2
  Charmrun> adding client 6: "host1_", IP:128.0.0.1
  Charmrun> adding client 7: "host2_", IP:128.0.0.2
  Charmrun> Charmrun = 128.0.0.1, port = 44971
  Charmrun> Sending "0 128.0.0.1 44971 19009 0" to client 0.
  Charmrun> find the node program "/home/robert/Documents/namd/host2" at
  "/home/robert/Documents/namd" for 0.
  Charmrun> Starting rsh host1_ -l robert /bin/sh -f
  Charmrun> Sending "1 128.0.0.1 44971 19009 0" to client 1.
  Charmrun> find the node program "/home/robert/Documents/namd/host2" at
  "/home/robert/Documents/namd" for 1.
  Charmrun> Starting rsh host2_ -l robert /bin/sh -f
  Charmrun> Sending "2 128.0.0.1 44971 19009 0" to client 2.
  Charmrun> find the node program "/home/robert/Documents/namd/host2" at
  "/home/robert/Documents/namd" for 2.
  Charmrun> Starting rsh host1_ -l robert /bin/sh -f
  Charmrun> Sending "3 128.0.0.1 44971 19009 0" to client 3.
  Charmrun> find the node program "/home/robert/Documents/namd/host2" at
  "/home/robert/Documents/namd" for 3.
  Charmrun> Starting rsh host2_ -l robert /bin/sh -f
  Charmrun> Sending "4 128.0.0.1 44971 19009 0" to client 4.
  Charmrun> find the node program "/home/robert/Documents/namd/host2" at
  "/home/robert/Documents/namd" for 4.
  Charmrun> Starting rsh host1_ -l robert /bin/sh -f
  Charmrun> Sending "5 128.0.0.1 44971 19009 0" to client 5.
  Charmrun> find the node program "/home/robert/Documents/namd/host2" at
  "/home/robert/Documents/namd" for 5.
  Charmrun> Starting rsh host2_ -l robert /bin/sh -f
  Charmrun> Sending "6 128.0.0.1 44971 19009 0" to client 6.
  Charmrun> find the node program "/home/robert/Documents/namd/host2" at
  "/home/robert/Documents/namd" for 6.
  Charmrun> Starting rsh host1_ -l robert /bin/sh -f
  Charmrun> Sending "7 128.0.0.1 44971 19009 0" to client 7.
  Charmrun> find the node program "/home/robert/Documents/namd/host2" at
  "/home/robert/Documents/namd" for 7.
  Charmrun> Starting rsh host2_ -l robert /bin/sh -f

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:57:04 CST