Re: how to run NAMD-CUDA on multiple nodes

From: Thomas Evangelidis (tevang3_at_gmail.com)
Date: Tue Nov 27 2012 - 10:13:07 CST

Hi Norman,

Thanks for your reply! Below are all the commands starting from "ib" that
are in the PATH:

ib_acme ibdiagui ibis
ib_read_bw ibtopodiff ibv_srq_pingpong
ib_write_lat
ib_clock_test ibdmchk IBMgtSim
ib_read_lat ibv_asyncwatch ibv_uc_pingpong
ibdev2netdev ibdmsh ibmsquit
ib_send_bw ibv_devices ibv_ud_pingpong
ibdiagnet ibdmtr ibmssh
ib_send_lat ibv_devinfo ib_write_bw
ibdiagpath ibdump ibnlparse
ibsim ibv_rc_pingpong ib_write_bw_postlist

And the output of /sbin/ifconfig:

ib0 Link encap:InfiniBand HWaddr
80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
          inet addr:172.31.103.1 Bcast:172.31.255.255 Mask:255.255.0.0
          inet6 addr: fe80::202:c903:10:56af/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:4096 Metric:1
          RX packets:606075818 errors:0 dropped:0 overruns:0 frame:0
          TX packets:622801559 errors:0 dropped:116 overruns:0 carrier:0
          collisions:0 txqueuelen:256
          RX bytes:103873275302 (96.7 GiB) TX bytes:164334746947 (153.0
GiB)

ib0:0 Link encap:InfiniBand HWaddr
80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
          inet addr:10.20.100.1 Bcast:10.20.100.255 Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST MTU:4096 Metric:1

ib0:1 Link encap:InfiniBand HWaddr
80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
          inet addr:10.20.100.4 Bcast:10.20.100.255 Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST MTU:4096 Metric:1

There also exists a /sys/class/net/ib0/ folder. I am able to run the
ibverbs NAMD version I compiled on multiple nodes without the GPUs with
great speed gains. There are no "ib_ping", "ibhosts" or "ibnodes" commands
on the login server as well as on the nodes. I have also written a
runscript.sh script for bash but yields the same error.

How do you find out if IPoIB is properly set up? If you can't work out what
may be wrong please give me some instructions about what to ask from the
cluster administrators.

many thanks,
Thomas

On 27 November 2012 16:57, Norman Geist <norman.geist_at_uni-greifswald.de>wrote:

> Hi,****
>
> ** **
>
> seems to be a different problem. Try /sbin/ifconfig.****
>
> ** **
>
> If there is really no ifconfig, check if the folder /sys/class/net/ib0/
> exist. This will also show if you already have IPoIB installed and loaded.
> ****
>
> ** **
>
> The other thing is, that your nodes seem to not be able to reach each
> other over the infiniband with ibverbs. Does other people use the
> infiniband successfully? Usually there are some commands starting with ib,
> like ib_ping for example. You should use them to check the connectivity of
> the nodes. Also the output of ibhosts and ibnodes are interesting to check
> the connectivity and setup.****
>
> ** **
>
> The runscript option itself is ok, but will you job also start in a csh
> environment? (Don’t know right now if this makes a difference)****
>
> ** **
>
> Let me know.****
>
> ** **
>
> Norman Geist.****
>
> ** **
>
> *Von:* owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] *Im
> Auftrag von *Thomas Evangelidis
>
> *Gesendet:* Dienstag, 27. November 2012 10:20
> *An:* Norman Geist
> *Cc:* Namd Mailing List
> *Betreff:* Re: namd-l: how to run NAMD-CUDA on multiple nodes****
>
> ** **
>
> Hi Norman,
>
> I've read all posts about it on the mailing list. There is no ifconfig
> command on the cluster nodes to see if there is IPoIB installed. I remember
> I did some other checks but didn't find IPoIB on the cluster. I don't use a
> precompiled binary, I compiled charm myself using net-linux-x86_64-****
>
> ibverbs-ifort-smp-icc flags. Can I compile IPoIB driver on my own? Would I
> be able then to run NAMD-CUDA on multiple nodes?****
>
> ** **
>
> thanks,****
>
> Thomas****
>
> ** **
>
> On 27 November 2012 10:09, Norman Geist <norman.geist_at_uni-greifswald.de>
> wrote:****
>
> Hi Thomas,****
>
> ****
>
> this problem has been posted a lot. The error you see is due the
> incompatibility of the precompiled ibverbs stuff vs. your ib installation.
> There are two possibilities to solve this:****
>
> ****
>
> 1. Use a non ibverbs binary with IPoIB****
>
> 2. Compile namd with ibverbs on your own.****
>
> ****
>
> Good luck****
>
> ****
>
> Norman Geist.****
>
> ****
>
> *Von:* owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] *Im
> Auftrag von *Thomas Evangelidis
> *Gesendet:* Montag, 26. November 2012 14:57
> *An:* namd-l
> *Cc:* Norman Geist
> *Betreff:* namd-l: how to run NAMD-CUDA on multiple nodes****
>
> ****
>
>
> Greetings,
>
> Although I can run the ibverbs binary with CUDA on a single node, on
> multiple nodes I get:
>
> Charmrun> error 0 attaching to node:
> Timeout waiting for node-program to connect
> Charmrun> IBVERBS version of charmrun
>
> I use this command line in my pbs script for ibverbs binary with CUDA:
>
> $NAMD_BIN/charmrun ++runscript ./runscript.csh ++verbose ++remote-shell
> ssh ++nodelist $nodefile +p24 $NAMD_BIN/namd2 +setcpuaffinity +idlepoll
> prod.amber.GB.aMD.namd
>
> runscript.csh contents are:
>
> #!/bin/csh
> CHARM_ARCH="net-linux-x86_64-ibverbs-ifort-smp-icc"
>
> NAMD_BIN="/gpfs/home/lspro220u2/Opt/NAMD_CVS-2012-09-22_Source/charm++_$CHARM_ARCH/Linux-x86_64-icc"
> setenv LD_LIBRARY_PATH "$NAMD_BIN:$LD_LIBRARY_PATH"
> $*
>
> Is this the way to run NAMD-ibverbs-cuda on multiple nodes? If not could
> you please give me the right command line?
>
> thanks,
> Thomas****
>
>
>
>
> -- ****
>
> ======================================================================****
>
> Thomas Evangelidis****
>
> PhD student****
>
> University of Athens
> Faculty of Pharmacy
> Department of Pharmaceutical Chemistry
> Panepistimioupoli-Zografou
> 157 71 Athens
> GREECE****
>
> email: tevang_at_pharm.uoa.gr****
>
> tevang3_at_gmail.com****
>
>
> website: https://sites.google.com/site/thomasevangelidishomepage/****
>
> ** **
>
> ** **
>

-- 
======================================================================
Thomas Evangelidis
PhD student
University of Athens
Faculty of Pharmacy
Department of Pharmaceutical Chemistry
Panepistimioupoli-Zografou
157 71 Athens
GREECE
email: tevang_at_pharm.uoa.gr
          tevang3_at_gmail.com
website: https://sites.google.com/site/thomasevangelidishomepage/

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:22:47 CST