AW: Asking help on results of our GPU benchmark

From: Norman Geist (
Date: Thu Dec 18 2014 - 02:39:51 CST



given the fact that you didn’t use the word “ibverbs” in your post, I suppose that you run your network traffic across IPoIB (ib0), is that right?

If so, could you please give me the output of:


cat /sys/class/net/ib0/m*


I suppose it will output something like:





But it should be:





Also please give the output of:


/sbin/ifconfig -a


Additionally, could we please see your benchmark data (time/step or days/ns) for the 1,2,4,8,16 node cases ?


Norman Geist.


Von: [] Im Auftrag von ???
Gesendet: Mittwoch, 17. Dezember 2014 22:13
Betreff: namd-l: Asking help on results of our GPU benchmark


Dear all,


We are asking help here concerning our GPU benchmark results, would be great and appreciate your reading (sorry for such a long letter) if you have experiences on using GPUs.

We are running NAMD on a cluster that consists of 48 nodes (dual E5-2630v2 processors - 12 cores per node, 32 GB of RAM, and a single Tesla K20x GPU per node). The nodes are interconnected by a non-blocking FDR InfiniBand fat-tree topology. We are testing the scalability of NAMD, and are running into some issues.


It seems that for a system of ~ 370K atoms, we are unable to scale beyond 16 nodes. We've tried both custom-compiling NAMD and using pre-built binaries (running version 2.10 in both cases). We get the best performance when custom compiling Charm++ and NAMD using Intel MPI version 5 (charm-arch mpi-linux-x86_64-smp). We then run with one MPI process per node (-np X -ppn 1, where X is the number of nodes) and 12 threads (++ppn 12). However, as mentioned, we are unable to scale between 16 nodes.


We've also tried building Charm++ without an underlying MPI library (charm architectures net-linux-x86_64-icc-ibverbs and net-linux-x86_64-icc-ibverbs-smp). However, with these builds, performance is slower than with the mpi-linux-x86_64 builds. When we run with "+p X ++ppn 12" it seems like the CPU time is considerably less than wall time, indicating that a lot of time is spent waiting for communication. We understand that the SMP version funnels everything through a single communication thread, but it is weird that this so dramatically limits the scalability of the non-MPI built versions of Charm++. We get somewhat better results from the non-SMP versions (+p 12*X), but it is still not as fast as the mpi-linux-x86_64-smp) when we scale to multiple nodes.


We should note that for non-CUDA (CPU only) NAMD, running with net-linux-x86_64-icc-ibverbs builds is substantially faster than the mpi-linux-x86_64 compiled versions. So it is a bit strange to us that for the CUDA case the situation is reversed so dramatically. We feel that we may not understand the optimal way to run on our new cluster. Does anyone have experience running on a distributed cluster where each node has a single GPU (as opposed to multiple GPUs per node)? Are there any performance tuning and optimization hints that you can share?


We've tried several different sizes of systems (with 370K atoms being the biggest, down to 70K atoms) and we are just not seeing scalability like we see from the CPU-only version.




This archive was generated by hypermail 2.1.6 : Thu Dec 31 2015 - 23:21:29 CST