Re: Asking help on results of our GPU benchmark

From: 周文昌 (wenchangyu2006_at_gmail.com)
Date: Tue Dec 23 2014 - 14:05:30 CST

Hi Norman,

Thanks a lot for your help, though I do not think there is network issue
since NAMD with CPU only scales up to 48 nodes. We could identify the
fabric problem if it does not scale. But apparently there isn't. It is
possible that because using a GPU is so much faster than a CPU, there's
some unforeseen scaling issue in the IB fabric that is creeping in. It's
also possible that Intel MPI is doing the wrong thing and trying to send
messages over the gigabit network in addition to the IB fabric.

Our staff here will run some InfiniBand diagnostic tools to check, I will
let you know when they got some numbers.

Thanks,

Wenchang

2014-12-23 3:34 GMT-05:00 Norman Geist <norman.geist_at_uni-greifswald.de>:

>
>
> *Von:* owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] *Im
> Auftrag von *???
> *Gesendet:* Montag, 22. Dezember 2014 19:11
> *An:* namd-l_at_ks.uiuc.edu; Norman Geist
> *Betreff:* Re: namd-l: Asking help on results of our GPU benchmark
>
>
>
> From my test results, +idlepoll and PME offload does not improve the
> performance. However, when I create more patches, for 8 nodes, I got 20%
> better, though no change when I run on 16 nodes.
>
> Probably you're right, I need to ask our staff to check the network. What
> other things are need to be checked, other than bandwidth?
>
> The really important thing is latency, but as it is usually inverse
> proportional to the bandwidth, checking the bandwidth should point out
> what’s wrong. Could you also please describe the topology of your fat-tree?
> So how many leafs and how many nodes per leaf. As a quick check for the
> network you could try using f.i. 4 nodes on the same leaf, vs. 4 nodes
> splitted up over different leafs. This should it practice give the same
> performance, if it does not, something is not properly set up or cabled.
>
> You might also want to enable IPoIB and use a standard network build of
> NAMD to exclude problems with ibverbs and RDMA. Also you should really make
> sure that only the hpc network is used, means monitoring the transferred
> data on the other networks (eth0, eth1) during your benchmark to check that
> there’s no computational traffic on it. (easiest way is to frequently
> ifconfig and have a look at the transferred data counts)
>
> Norman Geist
>
> Thanks,
>
> Wenchang
>
>
>
> 2014-12-22 2:16 GMT-05:00 Norman Geist <norman.geist_at_uni-greifswald.de>:
>
> *Von:* owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] *Im
> Auftrag von *???
> *Gesendet:* Freitag, 19. Dezember 2014 23:07
> *An:* namd-l_at_ks.uiuc.edu; Norman Geist
> *Betreff:* Re: namd-l: Asking help on results of our GPU benchmark
>
>
>
> Hi Norman,
>
> Thanks for your suggestions. Among those things you suggested, the
> twoaway[xyz] works for my system with 116K atoms, I got 20% better (but no
>
> Sure, this is only supposed to bring improvement when using GPUs.
>
> difference with CPU only). I also double the number on a system with 315K
> atoms. Could you explain how NAMD throws patches to GPU cores, why there
> is no difference using CPU only? I only have the tests on 8 nodes, will
> continue to test on 16, 32 nodes.
>
> I’m not sure but I think that each patch uses the GPU to compute its
> non-bonded stuff individually.
>
> I really think that you need to look for the problem on your network. NAMD
> is known to scale quite excellent. And on network topology it should be
> able to do so. Use the ib_* tools that are usually present to measure your
> bandwidth. What did +idlepoll do?
>
>
>
> Thanks,
>
> Wenchang
>
>
>
> 2014-12-19 1:30 GMT-05:00 Norman Geist <norman.geist_at_uni-greifswald.de>:
>
> *Von:* 周文昌 [mailto:wenchangyu2006_at_gmail.com]
> *Gesendet:* Donnerstag, 18. Dezember 2014 20:25
> *An:* namd-l_at_ks.uiuc.edu; Norman Geist
> *Betreff:* Re: namd-l: Asking help on results of our GPU benchmark
>
>
>
> Hi Norman,
>
> Thanks for your time, We use ibverbs directly (I did mention ibverbs in
> the 4th paragraph).
>
> Ok, now I’ve seen it ^^
>
> If I do /sbin/ifconfig -a, the output is following:
>
>
> eth0 Link encap:Ethernet HWaddr 0C:C4:7A:0F:63:F0
> inet addr:10.1.3.1 Bcast:10.1.255.255 Mask:255.255.0.0
> inet6 addr: fe80::ec4:7aff:fe0f:63f0/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:69053282 errors:0 dropped:0 overruns:0 frame:0
> TX packets:96176428 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:14237728569 (13.2 GiB) TX bytes:137024484424 (127.6
> GiB)
> Memory:dfa20000-dfa3ffff
>
> eth1 Link encap:Ethernet HWaddr 0C:C4:7A:0F:63:F1
> BROADCAST MULTICAST MTU:1500 Metric:1
> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
> Memory:dfa00000-dfa1ffff
>
> lo Link encap:Local Loopback
> inet addr:127.0.0.1 Mask:255.0.0.0
> inet6 addr: ::1/128 Scope:Host
> UP LOOPBACK RUNNING MTU:65536 Metric:1
> RX packets:316277 errors:0 dropped:0 overruns:0 frame:0
> TX packets:316277 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:82376375 (78.5 MiB) TX bytes:82376375 (78.5 MiB)
>
> Our numbers are below, I took the WallClock time at the end of each run
> (100,000 steps), instead of from "Benchmark time" in the NAMD output
>
>
>
> Number of nodes WallClock
> time ns/day
>
> CPU only
>
> 1.
>
> 4372.5
>
> 0.4
>
> CPU+GPU
>
> 1.
>
> 1220.4
>
> 1.4
>
> CPU+GPU
>
> 4.
>
> 332.6
>
> 5.2
>
> CPU+GPU
>
> 8.
>
> 208.2
>
> 8.3
>
> CPU+GPU
>
> 16.
>
> 135.2
>
> 12.8
>
> CPU+GPU
>
> 32.
>
> 106.3
>
> 16.3
>
> CPU+GPU
>
> 48.
>
> 97.5
>
> 17.7
>
>
>
> Doesn’t look that bad, although it feels like it should be better. Have
> you benchmarked your infiniband bandwidth already, FDR should do better
> here, especially in a fat tree topology. Some things you can try to
> generally improve scaling of namd:
>
>
>
> 1. Generally add “+idlepoll” to namd2
>
> 2. When using GPUs try adding “twoawayx yes” to the script, if that
> helps, try in addition “twoawayy yes”, if that helps try in addition
> “twowayz yes”. (This helps creating more patches and so might improve
> scalability of your system)
>
> 3. Try turning off/on the new pme reciprocal sum offload by
> “pmeoffload no/yes” in script.
>
>
>
> Sometimes, depending on your mpi, it might be necessary to exclude slow
> networks from the computation. As you see eth0 has a lot of traffic so make
> sure that you do not use mixed networks during some of your tests. I only
> know how it would be done using openmpi:
>
>
>
> mpirun ... --mca btl ^tcp … #this excluded all tcp networks
>
> mpirun … --mca btl openib #this included only ibverbs
>
>
>
> Also, you CPUs support HT, do you have it enabled? (It should be disabled
> better to prevent processes from sharing the same physical core)
>
>
>
> Please report back on what above changes will do for you.
>
>
>
> Norman Geist.
>
>
>
>
>
> Wenchang
>
>
>
> 2014-12-18 3:39 GMT-05:00 Norman Geist <norman.geist_at_uni-greifswald.de>:
>
> Hi,
>
>
>
> given the fact that you didn’t use the word “ibverbs” in your post, I
> suppose that you run your network traffic across IPoIB (ib0), is that
> right?
>
> If so, could you please give me the output of:
>
>
>
> cat /sys/class/net/ib0/m*
>
>
>
> I suppose it will output something like:
>
>
>
> datagram
>
> 2044
>
>
>
> But it should be:
>
>
>
> connected
>
> 65520
>
>
>
> Also please give the output of:
>
>
>
> /sbin/ifconfig -a
>
>
>
> Additionally, could we please see your benchmark data (time/step or
> days/ns) for the 1,2,4,8,16 node cases ?
>
>
>
> Norman Geist.
>
>
>
> *Von:* owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] *Im
> Auftrag von *???
> *Gesendet:* Mittwoch, 17. Dezember 2014 22:13
> *An:* namd-l_at_ks.uiuc.edu
> *Betreff:* namd-l: Asking help on results of our GPU benchmark
>
>
>
> Dear all,
>
>
>
> We are asking help here concerning our GPU benchmark results, would be
> great and appreciate your reading (sorry for such a long letter) if you
> have experiences on using GPUs.
>
>
> We are running NAMD on a cluster that consists of 48 nodes (dual E5-2630v2
> processors - 12 cores per node, 32 GB of RAM, and a single Tesla K20x GPU
> per node). The nodes are interconnected by a non-blocking FDR InfiniBand
> fat-tree topology. We are testing the scalability of NAMD, and are running
> into some issues.
>
>
>
> It seems that for a system of ~ 370K atoms, we are unable to scale beyond
> 16 nodes. We've tried both custom-compiling NAMD and using pre-built
> binaries (running version 2.10 in both cases). We get the best performance
> when custom compiling Charm++ and NAMD using Intel MPI version 5
> (charm-arch mpi-linux-x86_64-smp). We then run with one MPI process per
> node (-np X -ppn 1, where X is the number of nodes) and 12 threads (++ppn
> 12). However, as mentioned, we are unable to scale between 16 nodes.
>
>
>
> We've also tried building Charm++ without an underlying MPI library (charm
> architectures net-linux-x86_64-icc-ibverbs and
> net-linux-x86_64-icc-ibverbs-smp). However, with these builds, performance
> is slower than with the mpi-linux-x86_64 builds. When we run with "+p X
> ++ppn 12" it seems like the CPU time is considerably less than wall time,
> indicating that a lot of time is spent waiting for communication. We
> understand that the SMP version funnels everything through a single
> communication thread, but it is weird that this so dramatically limits the
> scalability of the non-MPI built versions of Charm++. We get somewhat
> better results from the non-SMP versions (+p 12*X), but it is still not as
> fast as the mpi-linux-x86_64-smp) when we scale to multiple nodes.
>
>
>
> We should note that for non-CUDA (CPU only) NAMD, running with
> net-linux-x86_64-icc-ibverbs builds is substantially faster than the
> mpi-linux-x86_64 compiled versions. So it is a bit strange to us that for
> the CUDA case the situation is reversed so dramatically. We feel that we
> may not understand the optimal way to run on our new cluster. Does anyone
> have experience running on a distributed cluster where each node has a
> single GPU (as opposed to multiple GPUs per node)? Are there any
> performance tuning and optimization hints that you can share?
>
>
>
> We've tried several different sizes of systems (with 370K atoms being the
> biggest, down to 70K atoms) and we are just not seeing scalability like we
> see from the CPU-only version.
>
>
>
> Thanks!
>
> Wenchang
>
>
>

This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:23:09 CST