From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Wed Nov 28 2012 - 03:45:53 CST
Ok, all this sounds good so far.
As a quick solution you can stay with the non ibverbs version of namd cuda.
A non TCP version is usually faster. To get the expected speedup you should
stay with the connected mode, but increase the mtu to 65520 to prevent the
ip packets from being fragmented.
$>echo “65520” > /sys/class/net/ib0/mtu
This should temporarily bring a nice scaling. It is possible that using
ibverbs would provide an better scaling compared to IPoIB, but to find out,
you will have to get it to work first. Therefore you should recompile the
namd cuda version with ibverbs. Remember that you may not use any cuda
features of charm++ and that you need the OFED driver installed to work with
the charm++ ib stuff. Otherwise, you will have to use MPI (ompi f.i.) as I
understood.
Good luck.
Norman Geist.
Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag
von Thomas Evangelidis
Gesendet: Mittwoch, 28. November 2012 08:45
An: Norman Geist
Cc: Namd Mailing List
Betreff: Re: namd-l: how to run NAMD-CUDA on multiple nodes
Hi Norman,
$ cat /sys/class/net/ib0/m*
connected
4096
Ok, you can run ibverbs binaries without GPU on the same nodes and network?
Yes
Basically your setup looks fine regarding IPoIB. So you could also try to
run a non-ibverbs CUDA binary and use IP traffic than. Whats the output of :
cat /sys/class/net/ib0/m*
$ cat /sys/class/net/ib0/m*
connected
4096
I currently can run NAMD-CUDA using a net-linux-x86_64-ifort-smp-icc binary
I compiled, not the ibverbs binary. But I don't get any performance gains if
I run it on multiple nodes with GPUs, the speed remains almost the same.
What happens if you start the run without the runscript? Do you get the
library not found message or something else?
I get that message about libcudart.so.4 not found.
thanks,
Thomas
Norman Geist.
> -----Ursprüngliche Nachricht-----
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> Auftrag von Thomas Evangelidis
> Gesendet: Dienstag, 27. November 2012 17:13
> An: Norman Geist
> Cc: Namd Mailing List
> Betreff: Re: namd-l: how to run NAMD-CUDA on multiple nodes
>
> Hi Norman,
>
> Thanks for your reply! Below are all the commands starting from "ib"
> that
> are in the PATH:
>
> ib_acme ibdiagui ibis
> ib_read_bw ibtopodiff ibv_srq_pingpong
> ib_write_lat
> ib_clock_test ibdmchk IBMgtSim
> ib_read_lat ibv_asyncwatch ibv_uc_pingpong
> ibdev2netdev ibdmsh ibmsquit
> ib_send_bw ibv_devices ibv_ud_pingpong
> ibdiagnet ibdmtr ibmssh
> ib_send_lat ibv_devinfo ib_write_bw
> ibdiagpath ibdump ibnlparse
> ibsim ibv_rc_pingpong ib_write_bw_postlist
>
> And the output of /sbin/ifconfig:
>
> ib0 Link encap:InfiniBand HWaddr
> 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
> inet addr:172.31.103.1 Bcast:172.31.255.255
> Mask:255.255.0.0
> inet6 addr: fe80::202:c903:10:56af/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:4096 Metric:1
> RX packets:606075818 errors:0 dropped:0 overruns:0 frame:0
> TX packets:622801559 errors:0 dropped:116 overruns:0
> carrier:0
> collisions:0 txqueuelen:256
> RX bytes:103873275302 (96.7 GiB) TX bytes:164334746947
> (153.0
> GiB)
>
> ib0:0 Link encap:InfiniBand HWaddr
> 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
> inet addr:10.20.100.1 Bcast:10.20.100.255
> Mask:255.255.255.0
> UP BROADCAST RUNNING MULTICAST MTU:4096 Metric:1
>
> ib0:1 Link encap:InfiniBand HWaddr
> 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
> inet addr:10.20.100.4 Bcast:10.20.100.255
> Mask:255.255.255.0
> UP BROADCAST RUNNING MULTICAST MTU:4096 Metric:1
>
> There also exists a /sys/class/net/ib0/ folder. I am able to run the
> ibverbs NAMD version I compiled on multiple nodes without the GPUs with
> great speed gains. There are no "ib_ping", "ibhosts" or "ibnodes"
> commands
> on the login server as well as on the nodes. I have also written a
> runscript.sh script for bash but yields the same error.
>
> How do you find out if IPoIB is properly set up? If you can't work out
> what
> may be wrong please give me some instructions about what to ask from
> the
> cluster administrators.
>
> many thanks,
> Thomas
>
>
>
>
> On 27 November 2012 16:57, Norman Geist <norman.geist_at_uni-
> greifswald.de>wrote:
>
> > Hi,****
> >
> > ** **
> >
> > seems to be a different problem. Try /sbin/ifconfig.****
> >
> > ** **
> >
> > If there is really no ifconfig, check if the folder
> /sys/class/net/ib0/
> > exist. This will also show if you already have IPoIB installed and
> loaded=
-- ====================================================================== Thomas Evangelidis PhD student University of Athens Faculty of Pharmacy Department of Pharmaceutical Chemistry Panepistimioupoli-Zografou 157 71 Athens GREECE email: tevang_at_pharm.uoa.gr tevang3_at_gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/
This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:22:18 CST