Asking help on results of our GPU benchmark

From: 周文昌 (wenchangyu2006_at_gmail.com)
Date: Wed Dec 17 2014 - 15:13:24 CST

Dear all,

We are asking help here concerning our GPU benchmark results, would be
great and appreciate your reading (sorry for such a long letter) if you
have experiences on using GPUs.

We are running NAMD on a cluster that consists of 48 nodes (dual E5-2630v2
processors - 12 cores per node, 32 GB of RAM, and a single Tesla K20x GPU
per node). The nodes are interconnected by a non-blocking FDR InfiniBand
fat-tree topology. We are testing the scalability of NAMD, and are running
into some issues.

 It seems that for a system of ~ 370K atoms, we are unable to scale beyond
16 nodes. We've tried both custom-compiling NAMD and using pre-built
binaries (running version 2.10 in both cases). We get the best performance
when custom compiling Charm++ and NAMD using Intel MPI version 5
(charm-arch mpi-linux-x86_64-smp). We then run with one MPI process per
node (-np X -ppn 1, where X is the number of nodes) and 12 threads (++ppn
12). However, as mentioned, we are unable to scale between 16 nodes.

 We've also tried building Charm++ without an underlying MPI library (charm
architectures net-linux-x86_64-icc-ibverbs and
net-linux-x86_64-icc-ibverbs-smp). However, with these builds, performance
is slower than with the mpi-linux-x86_64 builds. When we run with "+p X
++ppn 12" it seems like the CPU time is considerably less than wall time,
indicating that a lot of time is spent waiting for communication. We
understand that the SMP version funnels everything through a single
communication thread, but it is weird that this so dramatically limits the
scalability of the non-MPI built versions of Charm++. We get somewhat
better results from the non-SMP versions (+p 12*X), but it is still not as
fast as the mpi-linux-x86_64-smp) when we scale to multiple nodes.

 We should note that for non-CUDA (CPU only) NAMD, running with
net-linux-x86_64-icc-ibverbs builds is substantially faster than the
mpi-linux-x86_64 compiled versions. So it is a bit strange to us that for
the CUDA case the situation is reversed so dramatically. We feel that we
may not understand the optimal way to run on our new cluster. Does anyone
have experience running on a distributed cluster where each node has a
single GPU (as opposed to multiple GPUs per node)? Are there any
performance tuning and optimization hints that you can share?

 We've tried several different sizes of systems (with 370K atoms being the
biggest, down to 70K atoms) and we are just not seeing scalability like we
see from the CPU-only version.

Thanks!

Wenchang

This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:23:07 CST