AW: Asking help on results of our GPU benchmark

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Fri Dec 19 2014 - 00:30:22 CST

Von: 周文昌 [mailto:wenchangyu2006_at_gmail.com]
Gesendet: Donnerstag, 18. Dezember 2014 20:25
An: namd-l_at_ks.uiuc.edu; Norman Geist
Betreff: Re: namd-l: Asking help on results of our GPU benchmark

 

Hi Norman,

Thanks for your time, We use ibverbs directly (I did mention ibverbs in the 4th paragraph).

Ok, now I’ve seen it ^^

If I do /sbin/ifconfig -a, the output is following:

eth0 Link encap:Ethernet HWaddr 0C:C4:7A:0F:63:F0
          inet addr:10.1.3.1 Bcast:10.1.255.255 Mask:255.255.0.0
          inet6 addr: fe80::ec4:7aff:fe0f:63f0/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:69053282 errors:0 dropped:0 overruns:0 frame:0
          TX packets:96176428 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:14237728569 (13.2 GiB) TX bytes:137024484424 (127.6 GiB)
          Memory:dfa20000-dfa3ffff

eth1 Link encap:Ethernet HWaddr 0C:C4:7A:0F:63:F1
          BROADCAST MULTICAST MTU:1500 Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
          Memory:dfa00000-dfa1ffff

lo Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:65536 Metric:1
          RX packets:316277 errors:0 dropped:0 overruns:0 frame:0
          TX packets:316277 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:82376375 (78.5 MiB) TX bytes:82376375 (78.5 MiB)

Our numbers are below, I took the WallClock time at the end of each run (100,000 steps), instead of from "Benchmark time" in the NAMD output

 

                                Number of nodes WallClock time ns/day

CPU only

1.

4372.5

0.4

CPU+GPU

1.

1220.4

1.4

CPU+GPU

4.

332.6

5.2

CPU+GPU

8.

208.2

8.3

CPU+GPU

16.

135.2

12.8

CPU+GPU

32.

106.3

16.3

CPU+GPU

48.

97.5

17.7

 

Doesn’t look that bad, although it feels like it should be better. Have you benchmarked your infiniband bandwidth already, FDR should do better here, especially in a fat tree topology. Some things you can try to generally improve scaling of namd:

 

1. Generally add “+idlepoll” to namd2

2. When using GPUs try adding “twoawayx yes” to the script, if that helps, try in addition “twoawayy yes”, if that helps try in addition “twowayz yes”. (This helps creating more patches and so might improve scalability of your system)

3. Try turning off/on the new pme reciprocal sum offload by “pmeoffload no/yes” in script.

 

Sometimes, depending on your mpi, it might be necessary to exclude slow networks from the computation. As you see eth0 has a lot of traffic so make sure that you do not use mixed networks during some of your tests. I only know how it would be done using openmpi:

 

mpirun ... --mca btl ^tcp … #this excluded all tcp networks

mpirun … --mca btl openib #this included only ibverbs

 

Also, you CPUs support HT, do you have it enabled? (It should be disabled better to prevent processes from sharing the same physical core)

 

Please report back on what above changes will do for you.

 

Norman Geist.

 

 

Wenchang

 

2014-12-18 3:39 GMT-05:00 Norman Geist <norman.geist_at_uni-greifswald.de>:

Hi,

 

given the fact that you didn’t use the word “ibverbs” in your post, I suppose that you run your network traffic across IPoIB (ib0), is that right?

If so, could you please give me the output of:

 

cat /sys/class/net/ib0/m*

 

I suppose it will output something like:

 

datagram

2044

 

But it should be:

 

connected

65520

 

Also please give the output of:

 

/sbin/ifconfig -a

 

Additionally, could we please see your benchmark data (time/step or days/ns) for the 1,2,4,8,16 node cases ?

 

Norman Geist.

 

Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag von ???
Gesendet: Mittwoch, 17. Dezember 2014 22:13
An: namd-l_at_ks.uiuc.edu
Betreff: namd-l: Asking help on results of our GPU benchmark

 

Dear all,

 

We are asking help here concerning our GPU benchmark results, would be great and appreciate your reading (sorry for such a long letter) if you have experiences on using GPUs.

We are running NAMD on a cluster that consists of 48 nodes (dual E5-2630v2 processors - 12 cores per node, 32 GB of RAM, and a single Tesla K20x GPU per node). The nodes are interconnected by a non-blocking FDR InfiniBand fat-tree topology. We are testing the scalability of NAMD, and are running into some issues.

 

It seems that for a system of ~ 370K atoms, we are unable to scale beyond 16 nodes. We've tried both custom-compiling NAMD and using pre-built binaries (running version 2.10 in both cases). We get the best performance when custom compiling Charm++ and NAMD using Intel MPI version 5 (charm-arch mpi-linux-x86_64-smp). We then run with one MPI process per node (-np X -ppn 1, where X is the number of nodes) and 12 threads (++ppn 12). However, as mentioned, we are unable to scale between 16 nodes.

 

We've also tried building Charm++ without an underlying MPI library (charm architectures net-linux-x86_64-icc-ibverbs and net-linux-x86_64-icc-ibverbs-smp). However, with these builds, performance is slower than with the mpi-linux-x86_64 builds. When we run with "+p X ++ppn 12" it seems like the CPU time is considerably less than wall time, indicating that a lot of time is spent waiting for communication. We understand that the SMP version funnels everything through a single communication thread, but it is weird that this so dramatically limits the scalability of the non-MPI built versions of Charm++. We get somewhat better results from the non-SMP versions (+p 12*X), but it is still not as fast as the mpi-linux-x86_64-smp) when we scale to multiple nodes.

 

We should note that for non-CUDA (CPU only) NAMD, running with net-linux-x86_64-icc-ibverbs builds is substantially faster than the mpi-linux-x86_64 compiled versions. So it is a bit strange to us that for the CUDA case the situation is reversed so dramatically. We feel that we may not understand the optimal way to run on our new cluster. Does anyone have experience running on a distributed cluster where each node has a single GPU (as opposed to multiple GPUs per node)? Are there any performance tuning and optimization hints that you can share?

 

We've tried several different sizes of systems (with 370K atoms being the biggest, down to 70K atoms) and we are just not seeing scalability like we see from the CPU-only version.

 

Thanks!

Wenchang

This archive was generated by hypermail 2.1.6 : Thu Dec 31 2015 - 23:21:29 CST