AW: AW: Running NAMD parallel on two machines with 3 CUDAs

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Fri May 06 2011 - 01:36:20 CDT

Hi,

Nice to here that you could fix the library issue.

To the gpu thing, I just can't believe that ubuntu can only use two gpus
because I know machines with even 8 gpus in use, like the "fluidyna
typhoon". I think what they meant in the SLI forum was that you can only use
a three-way-sli under windows vista and 7, but you don't need sli to use the
gtx cards for computing. The problem is in my opinion your xorg.conf file. I
could imagine that if you use the three cards in sli, the nvidia driver
shows only one bundled gpu to the xorg server, if you don’t use sli the xorg
suddenly sees three gpus and is not configured for that. Another problem by
adding pci-devices is often that the pci-device-id changes, while it is
still configured in the xorg.conf, and the wrong device is targeted then. Do
you have NO image on screen or is it only text mode?? Also try to put the
monitor to all devices in case that the device id has changed. If you have
text mode, just try the nvidia-xconfig command which is shipped with the
driver. The tool should write a new xorg.conf for you, so your xorg server
will start up with the next reboot or "startx" or "service kdm restart".

Best regards

Norman Geist.

-----Ursprüngliche Nachricht-----
Von: Darko Stefanovski [mailto:stefanov_at_usc.edu]
Gesendet: Freitag, 6. Mai 2011 08:19
An: Norman Geist
Betreff: Re: AW: namd-l: Running NAMD parallel on two machines with 3 CUDAs

Hi Norman,

Thank you very much for your reply. I finally resolved the problem with the
libcudart.so.2 libraries by copying them to /usr/lib directory and not using
the ++runscript option in NAMD. In regards to the GPUs. I wrongly called the
GPUs CUDAs and I apologize for that. We have two computers with 3 GPUs per
computer (Nvidia GTX580). Initially, I ran the system with 2 GPUs and
everything worked fine. However, when I added the third GPU I could not get
any output on the Monitor. Finally, I found some SLI forums that suggested
we use the three-way SLI bridge. As soon as I put bridge I got my display
working. However, under Ubuntu 10.10 the NVIDA Server Control Panel actually
shows only two GPUs. So according to the SLI forum, only Vista and Windows 7
can use 3 GPUs. This is as far as I got setting up our systems. To answer
your question, use maximum of 2 GPUs per system and make sure that you have
2 cores per GPU.

Best wishes,
Darko

On May 5, 2011, at 10:41 PM, Norman Geist wrote:

> Hi Darko,
>
> it's me again. Another question comes up to me while reading your post
> again. The far I know u may not use SLI when running CUDA, because SLI
just
> bind up things in the gpu we would need for high frame throughput, but not
> for calculations. I think you have to deactivate SLI if you want to use
the
> gpus to compute, but correct me if I'm wrong, does the jobs just run fine
on
> one machine using multible gpus??
>
> Best regards
>
> Norman Geist.
>
>
> -----Ursprüngliche Nachricht-----
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag
> von Darko Stefanovski
> Gesendet: Donnerstag, 5. Mai 2011 18:54
> An: namd-l_at_ks.uiuc.edu
> Betreff: namd-l: Running NAMD parallel on two machines with 3 CUDAs
>
> Hi All,
>
> We are having some difficulties running NAMD on two identical machines
with
> three GTX 580 CUDAs (linked by three-way SLI bridge). Initially, we had
> some issues with getting the run-time libraries to work using
> LD_LIBRARY_PATH, but it seems that the problem was resolved using the
> runscript from the NAMD website. Still charmrun is exiting with Error
Code
> 1 on the host2. We execute the command from host1. We were wondering if
> anybody has any ideas how weshould proceed?
>
> Best wishes,
> Darko Stefanovski
>
> P.S. Bellow you will find the log,
>
> Charmrun remote shell(host1_.4)> remote responding...
> Charmrun remote shell(host1_.4)> starting node-program...
> Charmrun remote shell(host1_.4)> rsh phase successful.
> Charmrun remote shell(host1_.6)> remote responding...
> Charmrun remote shell(host1_.6)> starting node-program...
> Charmrun remote shell(host1_.6)> rsh phase successful.
> Charmrun remote shell(host1_.2)> remote responding...
> Charmrun remote shell(host1_.2)> starting node-program...
> Charmrun remote shell(host1_.2)> rsh phase successful.
> Charmrun remote shell(host1_.0)> remote responding...
> Charmrun remote shell(host1_.0)> starting node-program...
> Charmrun remote shell(host1_.0)> rsh phase successful.
> Charmrun remote shell(host2_.7)> remote responding...
> Charmrun remote shell(host2_.7)> starting node-program...
> Charmrun remote shell(host2_.7)> rsh phase successful.
> Charmrun remote shell(host2_.5)> remote responding...
> Charmrun remote shell(host2_.5)> starting node-program...
> Charmrun remote shell(host2_.5)> rsh phase successful.
> Charmrun remote shell(host2_.3)> remote responding...
> Charmrun remote shell(host2_.3)> starting node-program...
> Charmrun remote shell(host2_.3)> rsh phase successful.
> Charmrun remote shell(host2_.1)> remote responding...
> Charmrun remote shell(host2_.1)> starting node-program...
> Charmrun remote shell(host2_.1)> rsh phase successful.
> Charmrun remote shell(host2_.7)> Exiting with error code 1
> Charmrun remote shell(host2_.5)> Exiting with error code 1
> Charmrun remote shell(host2_.3)> Exiting with error code 1
> Charmrun remote shell(host2_.1)> Exiting with error code 1
> Charmrun> adding client 0: "host1_", IP:128.0.0.1
> Charmrun> adding client 1: "host2_", IP:128.0.0.2
> Charmrun> adding client 2: "host1_", IP:128.0.0.1
> Charmrun> adding client 3: "host2_", IP:128.0.0.2
> Charmrun> adding client 4: "host1_", IP:128.0.0.1
> Charmrun> adding client 5: "host2_", IP:128.0.0.2
> Charmrun> adding client 6: "host1_", IP:128.0.0.1
> Charmrun> adding client 7: "host2_", IP:128.0.0.2
> Charmrun> Charmrun = 128.0.0.1, port = 44971
> Charmrun> Sending "0 128.0.0.1 44971 19009 0" to client 0.
> Charmrun> find the node program "/home/robert/Documents/namd/host2" at
> "/home/robert/Documents/namd" for 0.
> Charmrun> Starting rsh host1_ -l robert /bin/sh -f
> Charmrun> Sending "1 128.0.0.1 44971 19009 0" to client 1.
> Charmrun> find the node program "/home/robert/Documents/namd/host2" at
> "/home/robert/Documents/namd" for 1.
> Charmrun> Starting rsh host2_ -l robert /bin/sh -f
> Charmrun> Sending "2 128.0.0.1 44971 19009 0" to client 2.
> Charmrun> find the node program "/home/robert/Documents/namd/host2" at
> "/home/robert/Documents/namd" for 2.
> Charmrun> Starting rsh host1_ -l robert /bin/sh -f
> Charmrun> Sending "3 128.0.0.1 44971 19009 0" to client 3.
> Charmrun> find the node program "/home/robert/Documents/namd/host2" at
> "/home/robert/Documents/namd" for 3.
> Charmrun> Starting rsh host2_ -l robert /bin/sh -f
> Charmrun> Sending "4 128.0.0.1 44971 19009 0" to client 4.
> Charmrun> find the node program "/home/robert/Documents/namd/host2" at
> "/home/robert/Documents/namd" for 4.
> Charmrun> Starting rsh host1_ -l robert /bin/sh -f
> Charmrun> Sending "5 128.0.0.1 44971 19009 0" to client 5.
> Charmrun> find the node program "/home/robert/Documents/namd/host2" at
> "/home/robert/Documents/namd" for 5.
> Charmrun> Starting rsh host2_ -l robert /bin/sh -f
> Charmrun> Sending "6 128.0.0.1 44971 19009 0" to client 6.
> Charmrun> find the node program "/home/robert/Documents/namd/host2" at
> "/home/robert/Documents/namd" for 6.
> Charmrun> Starting rsh host1_ -l robert /bin/sh -f
> Charmrun> Sending "7 128.0.0.1 44971 19009 0" to client 7.
> Charmrun> find the node program "/home/robert/Documents/namd/host2" at
> "/home/robert/Documents/namd" for 7.
> Charmrun> Starting rsh host2_ -l robert /bin/sh -f
>

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:20:13 CST