AW: Running NAMD parallel on two machines with 3 CUDAs

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Fri May 06 2011 - 00:41:11 CDT

Hi Darko,

it's me again. Another question comes up to me while reading your post
again. The far I know u may not use SLI when running CUDA, because SLI just
bind up things in the gpu we would need for high frame throughput, but not
for calculations. I think you have to deactivate SLI if you want to use the
gpus to compute, but correct me if I'm wrong, does the jobs just run fine on
one machine using multible gpus??

Best regards

Norman Geist.

-----Ursprüngliche Nachricht-----
Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag
von Darko Stefanovski
Gesendet: Donnerstag, 5. Mai 2011 18:54
An: namd-l_at_ks.uiuc.edu
Betreff: namd-l: Running NAMD parallel on two machines with 3 CUDAs

Hi All,
  
We are having some difficulties running NAMD on two identical machines with
three GTX 580 CUDAs (linked by three-way SLI bridge). Initially, we had
some issues with getting the run-time libraries to work using
LD_LIBRARY_PATH, but it seems that the problem was resolved using the
runscript from the NAMD website. Still charmrun is exiting with Error Code
1 on the host2. We execute the command from host1. We were wondering if
anybody has any ideas how weshould proceed?
  
  Best wishes,
  Darko Stefanovski
  
  P.S. Bellow you will find the log,
  
  Charmrun remote shell(host1_.4)> remote responding...
  Charmrun remote shell(host1_.4)> starting node-program...
  Charmrun remote shell(host1_.4)> rsh phase successful.
  Charmrun remote shell(host1_.6)> remote responding...
  Charmrun remote shell(host1_.6)> starting node-program...
  Charmrun remote shell(host1_.6)> rsh phase successful.
  Charmrun remote shell(host1_.2)> remote responding...
  Charmrun remote shell(host1_.2)> starting node-program...
  Charmrun remote shell(host1_.2)> rsh phase successful.
  Charmrun remote shell(host1_.0)> remote responding...
  Charmrun remote shell(host1_.0)> starting node-program...
  Charmrun remote shell(host1_.0)> rsh phase successful.
  Charmrun remote shell(host2_.7)> remote responding...
  Charmrun remote shell(host2_.7)> starting node-program...
  Charmrun remote shell(host2_.7)> rsh phase successful.
  Charmrun remote shell(host2_.5)> remote responding...
  Charmrun remote shell(host2_.5)> starting node-program...
  Charmrun remote shell(host2_.5)> rsh phase successful.
  Charmrun remote shell(host2_.3)> remote responding...
  Charmrun remote shell(host2_.3)> starting node-program...
  Charmrun remote shell(host2_.3)> rsh phase successful.
  Charmrun remote shell(host2_.1)> remote responding...
  Charmrun remote shell(host2_.1)> starting node-program...
  Charmrun remote shell(host2_.1)> rsh phase successful.
  Charmrun remote shell(host2_.7)> Exiting with error code 1
  Charmrun remote shell(host2_.5)> Exiting with error code 1
  Charmrun remote shell(host2_.3)> Exiting with error code 1
  Charmrun remote shell(host2_.1)> Exiting with error code 1
  Charmrun> adding client 0: "host1_", IP:128.0.0.1
  Charmrun> adding client 1: "host2_", IP:128.0.0.2
  Charmrun> adding client 2: "host1_", IP:128.0.0.1
  Charmrun> adding client 3: "host2_", IP:128.0.0.2
  Charmrun> adding client 4: "host1_", IP:128.0.0.1
  Charmrun> adding client 5: "host2_", IP:128.0.0.2
  Charmrun> adding client 6: "host1_", IP:128.0.0.1
  Charmrun> adding client 7: "host2_", IP:128.0.0.2
  Charmrun> Charmrun = 128.0.0.1, port = 44971
  Charmrun> Sending "0 128.0.0.1 44971 19009 0" to client 0.
  Charmrun> find the node program "/home/robert/Documents/namd/host2" at
  "/home/robert/Documents/namd" for 0.
  Charmrun> Starting rsh host1_ -l robert /bin/sh -f
  Charmrun> Sending "1 128.0.0.1 44971 19009 0" to client 1.
  Charmrun> find the node program "/home/robert/Documents/namd/host2" at
  "/home/robert/Documents/namd" for 1.
  Charmrun> Starting rsh host2_ -l robert /bin/sh -f
  Charmrun> Sending "2 128.0.0.1 44971 19009 0" to client 2.
  Charmrun> find the node program "/home/robert/Documents/namd/host2" at
  "/home/robert/Documents/namd" for 2.
  Charmrun> Starting rsh host1_ -l robert /bin/sh -f
  Charmrun> Sending "3 128.0.0.1 44971 19009 0" to client 3.
  Charmrun> find the node program "/home/robert/Documents/namd/host2" at
  "/home/robert/Documents/namd" for 3.
  Charmrun> Starting rsh host2_ -l robert /bin/sh -f
  Charmrun> Sending "4 128.0.0.1 44971 19009 0" to client 4.
  Charmrun> find the node program "/home/robert/Documents/namd/host2" at
  "/home/robert/Documents/namd" for 4.
  Charmrun> Starting rsh host1_ -l robert /bin/sh -f
  Charmrun> Sending "5 128.0.0.1 44971 19009 0" to client 5.
  Charmrun> find the node program "/home/robert/Documents/namd/host2" at
  "/home/robert/Documents/namd" for 5.
  Charmrun> Starting rsh host2_ -l robert /bin/sh -f
  Charmrun> Sending "6 128.0.0.1 44971 19009 0" to client 6.
  Charmrun> find the node program "/home/robert/Documents/namd/host2" at
  "/home/robert/Documents/namd" for 6.
  Charmrun> Starting rsh host1_ -l robert /bin/sh -f
  Charmrun> Sending "7 128.0.0.1 44971 19009 0" to client 7.
  Charmrun> find the node program "/home/robert/Documents/namd/host2" at
  "/home/robert/Documents/namd" for 7.
  Charmrun> Starting rsh host2_ -l robert /bin/sh -f

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 05:23:54 CST