Re: Asking help on results of our GPU benchmark

From: Axel Kohlmeyer (akohlmey_at_gmail.com)
Date: Mon Jan 05 2015 - 10:36:24 CST

the fundamental problem in this discussion is that "speedup" and "scaling"
are independent properties. they are not. also, there is a mixup of network
throughput and network latency. both do have an impact, but it is different
for different run scenarios.

in general a code scales out (i.e. the wallclock time doesn't decrease with
adding more processes) when the communication overhead becomes significant
compared to the time it takes to compute individual (parallel) tasks (or
"work units"). this refers to what is usually called _strong scaling_ (i.e.
parallel scaling for a fixed size problem).

the following factors can have an impact:
- the more processors are used, the fewer work units per processor are
available, thus the impact of communication overhead increases
- with GPU acceleration the impact of communication overhead increases
- GPUs require *many* more work units to be efficiently utilized, thus the
more nodes are used, the GPU acceleration becomes less once the number of
work units drops below a critical number
- CPUs and GPUs on multi-core nodes compete for limited memory and bus
bandwidth, thus the more processes per node, the more communication overhead
- parallel performance in strong scaling is almost exclusively determined
by communication latency and *much* less by bandwidth. even more so for
classical MD, where the amount of data is little
- since there is only one link, communication latency grows significantly
with multiple processes per node
- GPU acceleration for PME is less efficient than for non-bonded
interactions, for bonded interactions CPUs are often faster than GPUs
(there are not enough concurrent work units)
- simple classical force fields like CHARMM are in general is not very
demanding in terms of computation and thus doesn't generate many work units
unless you have a very large number of atoms
- the choice of real space cutoff has a significant impact since a longer
cutoff creates more work units, thus the crossover point where GPUs become
less efficient changes.

if you put this all together, the conclusions are:
- GPUs are most efficiently used in classical MD codes with fewer nodes
- it is quite possible that a CPU-only calculation scales to more nodes
than a GPU accelerated calculation
- it is also quite possible that a CPU-only calculation can run faster than
a GPU accelerated calculation, each as the strong scaling scale out point,
since the CPU-only calculation has less overhead.
- because so many different factors have an impact, there are no simple "do
this not that" rules.

On Mon, Jan 5, 2015 at 1:37 AM, Norman Geist <norman.geist_at_uni-greifswald.de
> wrote:

>
>
> *Von:* owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] *Im
> Auftrag von *???
> *Gesendet:* Dienstag, 23. Dezember 2014 21:06
> *An:* namd-l_at_ks.uiuc.edu; Norman Geist
> *Betreff:* Re: namd-l: Asking help on results of our GPU benchmark
>
>
>
> Hi Norman,
>
> Thanks a lot for your help, though I do not think there is network issue
> since NAMD with CPU only scales up to 48 nodes. We could identify the
>
> This conclusion is wrong. In practice the GPUs speeds up the subscribing
> CPUs to 3 to 5 times, compared to the CPU-only case. This raises the
> computing power of the network endpoint and therefore increases the
> requirements to the network significantly. You CANT compare the CPU only
> case against the GPU case in that way.
>
> fabric problem if it does not scale. But apparently there isn't. It is
> possible that because using a GPU is so much faster than a CPU, there's
> some unforeseen scaling issue in the IB fabric that is creeping in. It's
> also possible that Intel MPI is doing the wrong thing and trying to send
> messages over the gigabit network in addition to the IB fabric.
>
> Infiniband is a high performance network and shouldn’t have such
> “unforeseen scaling issues”. The issue about making sure only the
> Infiniband is used can usually be tested with options to mpirun. As already
> said I only know how it would be done using OpenMPI, but I would bet there
> are similar options for the MPI you are using.
>
> Our staff here will run some InfiniBand diagnostic tools to check, I will
> let you know when they got some numbers.
>
> Good luck
>
> Norman Geist
>
> Thanks,
>
> Wenchang
>
>
>
> 2014-12-23 3:34 GMT-05:00 Norman Geist <norman.geist_at_uni-greifswald.de>:
>
>
>
> *Von:* owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] *Im
> Auftrag von *???
> *Gesendet:* Montag, 22. Dezember 2014 19:11
> *An:* namd-l_at_ks.uiuc.edu; Norman Geist
> *Betreff:* Re: namd-l: Asking help on results of our GPU benchmark
>
>
>
> From my test results, +idlepoll and PME offload does not improve the
> performance. However, when I create more patches, for 8 nodes, I got 20%
> better, though no change when I run on 16 nodes.
>
> Probably you're right, I need to ask our staff to check the network. What
> other things are need to be checked, other than bandwidth?
>
> The really important thing is latency, but as it is usually inverse
> proportional to the bandwidth, checking the bandwidth should point out
> what’s wrong. Could you also please describe the topology of your fat-tree?
> So how many leafs and how many nodes per leaf. As a quick check for the
> network you could try using f.i. 4 nodes on the same leaf, vs. 4 nodes
> splitted up over different leafs. This should it practice give the same
> performance, if it does not, something is not properly set up or cabled.
>
> You might also want to enable IPoIB and use a standard network build of
> NAMD to exclude problems with ibverbs and RDMA. Also you should really make
> sure that only the hpc network is used, means monitoring the transferred
> data on the other networks (eth0, eth1) during your benchmark to check that
> there’s no computational traffic on it. (easiest way is to frequently
> ifconfig and have a look at the transferred data counts)
>
> Norman Geist
>
> Thanks,
>
> Wenchang
>
>
>
> 2014-12-22 2:16 GMT-05:00 Norman Geist <norman.geist_at_uni-greifswald.de>:
>
> *Von:* owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] *Im
> Auftrag von *???
> *Gesendet:* Freitag, 19. Dezember 2014 23:07
> *An:* namd-l_at_ks.uiuc.edu; Norman Geist
> *Betreff:* Re: namd-l: Asking help on results of our GPU benchmark
>
>
>
> Hi Norman,
>
> Thanks for your suggestions. Among those things you suggested, the
> twoaway[xyz] works for my system with 116K atoms, I got 20% better (but no
>
> Sure, this is only supposed to bring improvement when using GPUs.
>
> difference with CPU only). I also double the number on a system with 315K
> atoms. Could you explain how NAMD throws patches to GPU cores, why there
> is no difference using CPU only? I only have the tests on 8 nodes, will
> continue to test on 16, 32 nodes.
>
> I’m not sure but I think that each patch uses the GPU to compute its
> non-bonded stuff individually.
>
> I really think that you need to look for the problem on your network. NAMD
> is known to scale quite excellent. And on network topology it should be
> able to do so. Use the ib_* tools that are usually present to measure your
> bandwidth. What did +idlepoll do?
>
>
>
> Thanks,
>
> Wenchang
>
>
>
> 2014-12-19 1:30 GMT-05:00 Norman Geist <norman.geist_at_uni-greifswald.de>:
>
> *Von:* 周文昌 [mailto:wenchangyu2006_at_gmail.com]
> *Gesendet:* Donnerstag, 18. Dezember 2014 20:25
> *An:* namd-l_at_ks.uiuc.edu; Norman Geist
> *Betreff:* Re: namd-l: Asking help on results of our GPU benchmark
>
>
>
> Hi Norman,
>
> Thanks for your time, We use ibverbs directly (I did mention ibverbs in
> the 4th paragraph).
>
> Ok, now I’ve seen it ^^
>
> If I do /sbin/ifconfig -a, the output is following:
>
>
> eth0 Link encap:Ethernet HWaddr 0C:C4:7A:0F:63:F0
> inet addr:10.1.3.1 Bcast:10.1.255.255 Mask:255.255.0.0
> inet6 addr: fe80::ec4:7aff:fe0f:63f0/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:69053282 errors:0 dropped:0 overruns:0 frame:0
> TX packets:96176428 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:14237728569 (13.2 GiB) TX bytes:137024484424 (127.6
> GiB)
> Memory:dfa20000-dfa3ffff
>
> eth1 Link encap:Ethernet HWaddr 0C:C4:7A:0F:63:F1
> BROADCAST MULTICAST MTU:1500 Metric:1
> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
> Memory:dfa00000-dfa1ffff
>
> lo Link encap:Local Loopback
> inet addr:127.0.0.1 Mask:255.0.0.0
> inet6 addr: ::1/128 Scope:Host
> UP LOOPBACK RUNNING MTU:65536 Metric:1
> RX packets:316277 errors:0 dropped:0 overruns:0 frame:0
> TX packets:316277 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:82376375 (78.5 MiB) TX bytes:82376375 (78.5 MiB)
>
> Our numbers are below, I took the WallClock time at the end of each run
> (100,000 steps), instead of from "Benchmark time" in the NAMD output
>
>
>
> Number of nodes WallClock
> time ns/day
>
> CPU only
>
> 1.
>
> 4372.5
>
> 0.4
>
> CPU+GPU
>
> 1.
>
> 1220.4
>
> 1.4
>
> CPU+GPU
>
> 4.
>
> 332.6
>
> 5.2
>
> CPU+GPU
>
> 8.
>
> 208.2
>
> 8.3
>
> CPU+GPU
>
> 16.
>
> 135.2
>
> 12.8
>
> CPU+GPU
>
> 32.
>
> 106.3
>
> 16.3
>
> CPU+GPU
>
> 48.
>
> 97.5
>
> 17.7
>
>
>
> Doesn’t look that bad, although it feels like it should be better. Have
> you benchmarked your infiniband bandwidth already, FDR should do better
> here, especially in a fat tree topology. Some things you can try to
> generally improve scaling of namd:
>
>
>
> 1. Generally add “+idlepoll” to namd2
>
> 2. When using GPUs try adding “twoawayx yes” to the script, if that
> helps, try in addition “twoawayy yes”, if that helps try in addition
> “twowayz yes”. (This helps creating more patches and so might improve
> scalability of your system)
>
> 3. Try turning off/on the new pme reciprocal sum offload by
> “pmeoffload no/yes” in script.
>
>
>
> Sometimes, depending on your mpi, it might be necessary to exclude slow
> networks from the computation. As you see eth0 has a lot of traffic so make
> sure that you do not use mixed networks during some of your tests. I only
> know how it would be done using openmpi:
>
>
>
> mpirun ... --mca btl ^tcp … #this excluded all tcp networks
>
> mpirun … --mca btl openib #this included only ibverbs
>
>
>
> Also, you CPUs support HT, do you have it enabled? (It should be disabled
> better to prevent processes from sharing the same physical core)
>
>
>
> Please report back on what above changes will do for you.
>
>
>
> Norman Geist.
>
>
>
>
>
> Wenchang
>
>
>
> 2014-12-18 3:39 GMT-05:00 Norman Geist <norman.geist_at_uni-greifswald.de>:
>
> Hi,
>
>
>
> given the fact that you didn’t use the word “ibverbs” in your post, I
> suppose that you run your network traffic across IPoIB (ib0), is that
> right?
>
> If so, could you please give me the output of:
>
>
>
> cat /sys/class/net/ib0/m*
>
>
>
> I suppose it will output something like:
>
>
>
> datagram
>
> 2044
>
>
>
> But it should be:
>
>
>
> connected
>
> 65520
>
>
>
> Also please give the output of:
>
>
>
> /sbin/ifconfig -a
>
>
>
> Additionally, could we please see your benchmark data (time/step or
> days/ns) for the 1,2,4,8,16 node cases ?
>
>
>
> Norman Geist.
>
>
>
> *Von:* owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] *Im
> Auftrag von *???
> *Gesendet:* Mittwoch, 17. Dezember 2014 22:13
> *An:* namd-l_at_ks.uiuc.edu
> *Betreff:* namd-l: Asking help on results of our GPU benchmark
>
>
>
> Dear all,
>
>
>
> We are asking help here concerning our GPU benchmark results, would be
> great and appreciate your reading (sorry for such a long letter) if you
> have experiences on using GPUs.
>
>
> We are running NAMD on a cluster that consists of 48 nodes (dual E5-2630v2
> processors - 12 cores per node, 32 GB of RAM, and a single Tesla K20x GPU
> per node). The nodes are interconnected by a non-blocking FDR InfiniBand
> fat-tree topology. We are testing the scalability of NAMD, and are running
> into some issues.
>
>
>
> It seems that for a system of ~ 370K atoms, we are unable to scale beyond
> 16 nodes. We've tried both custom-compiling NAMD and using pre-built
> binaries (running version 2.10 in both cases). We get the best performance
> when custom compiling Charm++ and NAMD using Intel MPI version 5
> (charm-arch mpi-linux-x86_64-smp). We then run with one MPI process per
> node (-np X -ppn 1, where X is the number of nodes) and 12 threads (++ppn
> 12). However, as mentioned, we are unable to scale between 16 nodes.
>
>
>
> We've also tried building Charm++ without an underlying MPI library (charm
> architectures net-linux-x86_64-icc-ibverbs and
> net-linux-x86_64-icc-ibverbs-smp). However, with these builds, performance
> is slower than with the mpi-linux-x86_64 builds. When we run with "+p X
> ++ppn 12" it seems like the CPU time is considerably less than wall time,
> indicating that a lot of time is spent waiting for communication. We
> understand that the SMP version funnels everything through a single
> communication thread, but it is weird that this so dramatically limits the
> scalability of the non-MPI built versions of Charm++. We get somewhat
> better results from the non-SMP versions (+p 12*X), but it is still not as
> fast as the mpi-linux-x86_64-smp) when we scale to multiple nodes.
>
>
>
> We should note that for non-CUDA (CPU only) NAMD, running with
> net-linux-x86_64-icc-ibverbs builds is substantially faster than the
> mpi-linux-x86_64 compiled versions. So it is a bit strange to us that for
> the CUDA case the situation is reversed so dramatically. We feel that we
> may not understand the optimal way to run on our new cluster. Does anyone
> have experience running on a distributed cluster where each node has a
> single GPU (as opposed to multiple GPUs per node)? Are there any
> performance tuning and optimization hints that you can share?
>
>
>
> We've tried several different sizes of systems (with 370K atoms being the
> biggest, down to 70K atoms) and we are just not seeing scalability like we
> see from the CPU-only version.
>
>
>
> Thanks!
>
> Wenchang
>
>
>
>
>

-- 
Dr. Axel Kohlmeyer  akohlmey_at_gmail.com  http://goo.gl/1wk0
College of Science & Technology, Temple University, Philadelphia PA, USA
International Centre for Theoretical Physics, Trieste. Italy.

This archive was generated by hypermail 2.1.6 : Tue Dec 27 2016 - 23:20:47 CST