Re: Asking help on results of our GPU benchmark

From: 周文昌 (wenchangyu2006_at_gmail.com)
Date: Thu Dec 18 2014 - 13:24:59 CST

Hi Norman,

Thanks for your time, We use ibverbs directly (I did mention ibverbs in the
4th paragraph).

If I do /sbin/ifconfig -a, the output is following:

eth0 Link encap:Ethernet HWaddr 0C:C4:7A:0F:63:F0
          inet addr:10.1.3.1 Bcast:10.1.255.255 Mask:255.255.0.0
          inet6 addr: fe80::ec4:7aff:fe0f:63f0/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:69053282 errors:0 dropped:0 overruns:0 frame:0
          TX packets:96176428 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:14237728569 (13.2 GiB) TX bytes:137024484424 (127.6 GiB)
          Memory:dfa20000-dfa3ffff

eth1 Link encap:Ethernet HWaddr 0C:C4:7A:0F:63:F1
          BROADCAST MULTICAST MTU:1500 Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
          Memory:dfa00000-dfa1ffff

lo Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:65536 Metric:1
          RX packets:316277 errors:0 dropped:0 overruns:0 frame:0
          TX packets:316277 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:82376375 (78.5 MiB) TX bytes:82376375 (78.5 MiB)

Our numbers are below, I took the WallClock time at the end of each run
(100,000 steps), instead of from "Benchmark time" in the NAMD output

                                Number of nodes WallClock
time ns/day
CPU only1.4372.50.4 CPU+GPU1.1220.41.4 CPU+GPU4.332.65.2 CPU+GPU8.208.2
8.3 CPU+GPU16.135.212.8 CPU+GPU32.106.316.3 CPU+GPU48.97.517.7

Wenchang

2014-12-18 3:39 GMT-05:00 Norman Geist <norman.geist_at_uni-greifswald.de>:
>
> Hi,
>
>
>
> given the fact that you didn’t use the word “ibverbs” in your post, I
> suppose that you run your network traffic across IPoIB (ib0), is that
> right?
>
> If so, could you please give me the output of:
>
>
>
> cat /sys/class/net/ib0/m*
>
>
>
> I suppose it will output something like:
>
>
>
> datagram
>
> 2044
>
>
>
> But it should be:
>
>
>
> connected
>
> 65520
>
>
>
> Also please give the output of:
>
>
>
> /sbin/ifconfig -a
>
>
>
> Additionally, could we please see your benchmark data (time/step or
> days/ns) for the 1,2,4,8,16 node cases ?
>
>
>
> Norman Geist.
>
>
>
> *Von:* owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] *Im
> Auftrag von *???
> *Gesendet:* Mittwoch, 17. Dezember 2014 22:13
> *An:* namd-l_at_ks.uiuc.edu
> *Betreff:* namd-l: Asking help on results of our GPU benchmark
>
>
>
> Dear all,
>
>
>
> We are asking help here concerning our GPU benchmark results, would be
> great and appreciate your reading (sorry for such a long letter) if you
> have experiences on using GPUs.
>
>
> We are running NAMD on a cluster that consists of 48 nodes (dual E5-2630v2
> processors - 12 cores per node, 32 GB of RAM, and a single Tesla K20x GPU
> per node). The nodes are interconnected by a non-blocking FDR InfiniBand
> fat-tree topology. We are testing the scalability of NAMD, and are running
> into some issues.
>
>
>
> It seems that for a system of ~ 370K atoms, we are unable to scale beyond
> 16 nodes. We've tried both custom-compiling NAMD and using pre-built
> binaries (running version 2.10 in both cases). We get the best performance
> when custom compiling Charm++ and NAMD using Intel MPI version 5
> (charm-arch mpi-linux-x86_64-smp). We then run with one MPI process per
> node (-np X -ppn 1, where X is the number of nodes) and 12 threads (++ppn
> 12). However, as mentioned, we are unable to scale between 16 nodes.
>
>
>
> We've also tried building Charm++ without an underlying MPI library (charm
> architectures net-linux-x86_64-icc-ibverbs and
> net-linux-x86_64-icc-ibverbs-smp). However, with these builds, performance
> is slower than with the mpi-linux-x86_64 builds. When we run with "+p X
> ++ppn 12" it seems like the CPU time is considerably less than wall time,
> indicating that a lot of time is spent waiting for communication. We
> understand that the SMP version funnels everything through a single
> communication thread, but it is weird that this so dramatically limits the
> scalability of the non-MPI built versions of Charm++. We get somewhat
> better results from the non-SMP versions (+p 12*X), but it is still not as
> fast as the mpi-linux-x86_64-smp) when we scale to multiple nodes.
>
>
>
> We should note that for non-CUDA (CPU only) NAMD, running with
> net-linux-x86_64-icc-ibverbs builds is substantially faster than the
> mpi-linux-x86_64 compiled versions. So it is a bit strange to us that for
> the CUDA case the situation is reversed so dramatically. We feel that we
> may not understand the optimal way to run on our new cluster. Does anyone
> have experience running on a distributed cluster where each node has a
> single GPU (as opposed to multiple GPUs per node)? Are there any
> performance tuning and optimization hints that you can share?
>
>
>
> We've tried several different sizes of systems (with 370K atoms being the
> biggest, down to 70K atoms) and we are just not seeing scalability like we
> see from the CPU-only version.
>
>
>
> Thanks!
>
> Wenchang
>

This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:23:08 CST