AW: Asking help on results of our GPU benchmark

From: Norman Geist (
Date: Mon Jan 05 2015 - 00:37:40 CST


Von: [] Im Auftrag von ???
Gesendet: Dienstag, 23. Dezember 2014 21:06
An:; Norman Geist
Betreff: Re: namd-l: Asking help on results of our GPU benchmark


Hi Norman,

Thanks a lot for your help, though I do not think there is network issue since NAMD with CPU only scales up to 48 nodes. We could identify the

This conclusion is wrong. In practice the GPUs speeds up the subscribing CPUs to 3 to 5 times, compared to the CPU-only case. This raises the computing power of the network endpoint and therefore increases the requirements to the network significantly. You CANT compare the CPU only case against the GPU case in that way.

fabric problem if it does not scale. But apparently there isn't. It is possible that because using a GPU is so much faster than a CPU, there's some unforeseen scaling issue in the IB fabric that is creeping in. It's also possible that Intel MPI is doing the wrong thing and trying to send messages over the gigabit network in addition to the IB fabric.

Infiniband is a high performance network and shouldn’t have such “unforeseen scaling issues”. The issue about making sure only the Infiniband is used can usually be tested with options to mpirun. As already said I only know how it would be done using OpenMPI, but I would bet there are similar options for the MPI you are using.

Our staff here will run some InfiniBand diagnostic tools to check, I will let you know when they got some numbers.

Good luck

Norman Geist




2014-12-23 3:34 GMT-05:00 Norman Geist <>:


Von: [] Im Auftrag von ???
Gesendet: Montag, 22. Dezember 2014 19:11
An:; Norman Geist
Betreff: Re: namd-l: Asking help on results of our GPU benchmark


>From my test results, +idlepoll and PME offload does not improve the
performance. However, when I create more patches, for 8 nodes, I got 20%
better, though no change when I run on 16 nodes.

Probably you're right, I need to ask our staff to check the network. What other things are need to be checked, other than bandwidth?

The really important thing is latency, but as it is usually inverse proportional to the bandwidth, checking the bandwidth should point out what’s wrong. Could you also please describe the topology of your fat-tree? So how many leafs and how many nodes per leaf. As a quick check for the network you could try using f.i. 4 nodes on the same leaf, vs. 4 nodes splitted up over different leafs. This should it practice give the same performance, if it does not, something is not properly set up or cabled.

You might also want to enable IPoIB and use a standard network build of NAMD to exclude problems with ibverbs and RDMA. Also you should really make sure that only the hpc network is used, means monitoring the transferred data on the other networks (eth0, eth1) during your benchmark to check that there’s no computational traffic on it. (easiest way is to frequently ifconfig and have a look at the transferred data counts)

Norman Geist




2014-12-22 2:16 GMT-05:00 Norman Geist <>:

Von: [] Im Auftrag von ???
Gesendet: Freitag, 19. Dezember 2014 23:07
An:; Norman Geist
Betreff: Re: namd-l: Asking help on results of our GPU benchmark


Hi Norman,

Thanks for your suggestions. Among those things you suggested, the twoaway[xyz] works for my system with 116K atoms, I got 20% better (but no

Sure, this is only supposed to bring improvement when using GPUs.

difference with CPU only). I also double the number on a system with 315K atoms. Could you explain how NAMD throws patches to GPU cores, why there is no difference using CPU only? I only have the tests on 8 nodes, will continue to test on 16, 32 nodes.

I’m not sure but I think that each patch uses the GPU to compute its non-bonded stuff individually.

I really think that you need to look for the problem on your network. NAMD is known to scale quite excellent. And on network topology it should be able to do so. Use the ib_* tools that are usually present to measure your bandwidth. What did +idlepoll do?





2014-12-19 1:30 GMT-05:00 Norman Geist <>:

Von: 周文昌 []
Gesendet: Donnerstag, 18. Dezember 2014 20:25
An:; Norman Geist
Betreff: Re: namd-l: Asking help on results of our GPU benchmark


Hi Norman,

Thanks for your time, We use ibverbs directly (I did mention ibverbs in the 4th paragraph).

Ok, now I’ve seen it ^^

If I do /sbin/ifconfig -a, the output is following:

eth0 Link encap:Ethernet HWaddr 0C:C4:7A:0F:63:F0
          inet addr: Bcast: Mask:
          inet6 addr: fe80::ec4:7aff:fe0f:63f0/64 Scope:Link
          RX packets:69053282 errors:0 dropped:0 overruns:0 frame:0
          TX packets:96176428 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:14237728569 (13.2 GiB) TX bytes:137024484424 (127.6 GiB)

eth1 Link encap:Ethernet HWaddr 0C:C4:7A:0F:63:F1
          BROADCAST MULTICAST MTU:1500 Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

lo Link encap:Local Loopback
          inet addr: Mask:
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:65536 Metric:1
          RX packets:316277 errors:0 dropped:0 overruns:0 frame:0
          TX packets:316277 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:82376375 (78.5 MiB) TX bytes:82376375 (78.5 MiB)

Our numbers are below, I took the WallClock time at the end of each run (100,000 steps), instead of from "Benchmark time" in the NAMD output


                                Number of nodes WallClock time ns/day

CPU only





























Doesn’t look that bad, although it feels like it should be better. Have you benchmarked your infiniband bandwidth already, FDR should do better here, especially in a fat tree topology. Some things you can try to generally improve scaling of namd:


1. Generally add “+idlepoll” to namd2

2. When using GPUs try adding “twoawayx yes” to the script, if that helps, try in addition “twoawayy yes”, if that helps try in addition “twowayz yes”. (This helps creating more patches and so might improve scalability of your system)

3. Try turning off/on the new pme reciprocal sum offload by “pmeoffload no/yes” in script.


Sometimes, depending on your mpi, it might be necessary to exclude slow networks from the computation. As you see eth0 has a lot of traffic so make sure that you do not use mixed networks during some of your tests. I only know how it would be done using openmpi:


mpirun ... --mca btl ^tcp … #this excluded all tcp networks

mpirun … --mca btl openib #this included only ibverbs


Also, you CPUs support HT, do you have it enabled? (It should be disabled better to prevent processes from sharing the same physical core)


Please report back on what above changes will do for you.


Norman Geist.





2014-12-18 3:39 GMT-05:00 Norman Geist <>:



given the fact that you didn’t use the word “ibverbs” in your post, I suppose that you run your network traffic across IPoIB (ib0), is that right?

If so, could you please give me the output of:


cat /sys/class/net/ib0/m*


I suppose it will output something like:





But it should be:





Also please give the output of:


/sbin/ifconfig -a


Additionally, could we please see your benchmark data (time/step or days/ns) for the 1,2,4,8,16 node cases ?


Norman Geist.


Von: [] Im Auftrag von ???
Gesendet: Mittwoch, 17. Dezember 2014 22:13
Betreff: namd-l: Asking help on results of our GPU benchmark


Dear all,


We are asking help here concerning our GPU benchmark results, would be great and appreciate your reading (sorry for such a long letter) if you have experiences on using GPUs.

We are running NAMD on a cluster that consists of 48 nodes (dual E5-2630v2 processors - 12 cores per node, 32 GB of RAM, and a single Tesla K20x GPU per node). The nodes are interconnected by a non-blocking FDR InfiniBand fat-tree topology. We are testing the scalability of NAMD, and are running into some issues.


It seems that for a system of ~ 370K atoms, we are unable to scale beyond 16 nodes. We've tried both custom-compiling NAMD and using pre-built binaries (running version 2.10 in both cases). We get the best performance when custom compiling Charm++ and NAMD using Intel MPI version 5 (charm-arch mpi-linux-x86_64-smp). We then run with one MPI process per node (-np X -ppn 1, where X is the number of nodes) and 12 threads (++ppn 12). However, as mentioned, we are unable to scale between 16 nodes.


We've also tried building Charm++ without an underlying MPI library (charm architectures net-linux-x86_64-icc-ibverbs and net-linux-x86_64-icc-ibverbs-smp). However, with these builds, performance is slower than with the mpi-linux-x86_64 builds. When we run with "+p X ++ppn 12" it seems like the CPU time is considerably less than wall time, indicating that a lot of time is spent waiting for communication. We understand that the SMP version funnels everything through a single communication thread, but it is weird that this so dramatically limits the scalability of the non-MPI built versions of Charm++. We get somewhat better results from the non-SMP versions (+p 12*X), but it is still not as fast as the mpi-linux-x86_64-smp) when we scale to multiple nodes.


We should note that for non-CUDA (CPU only) NAMD, running with net-linux-x86_64-icc-ibverbs builds is substantially faster than the mpi-linux-x86_64 compiled versions. So it is a bit strange to us that for the CUDA case the situation is reversed so dramatically. We feel that we may not understand the optimal way to run on our new cluster. Does anyone have experience running on a distributed cluster where each node has a single GPU (as opposed to multiple GPUs per node)? Are there any performance tuning and optimization hints that you can share?


We've tried several different sizes of systems (with 370K atoms being the biggest, down to 70K atoms) and we are just not seeing scalability like we see from the CPU-only version.






This archive was generated by hypermail 2.1.6 : Thu Dec 31 2015 - 23:21:31 CST