Javier Cabezas, Isaac Gelado, John E. Stone, Nacho Navarro, David B. Kirk, and
Wen-mei Hwu.
Runtime and architecture support for efficient data exchange in
multi-accelerator applications.
IEEE Transactions on Parallel and Distributed Systems,
26:1405-1418, May 2015.
(PMC: PMC4500157)
CABE2015-JS
Heterogeneous parallel computing applications often process large data
sets that require
multiple GPUs to jointly meet their needs for physical memory capacity
and compute
throughput. However, the lack of high-level abstractions in previous
heterogeneous
parallel programming models force programmers to resort to multiple
code versions,
complex data copy steps and synchronization schemes when
exchanging data between
multiple GPU devices, which results in high software development cost,
poor
maintainability, and even poor performance. This paper describes the
HPE runtime system,
and the associated architecture support, which enables a simple,
efficient programming
interface for exchanging data between multiple GPUs through either
interconnects or
crossnode network interfaces.
The runtime and architecture support presented in this paper can also be
used to support
other types of accelerators.
We show that the simplified programming interface reduces
programming complexity. The
research presented in this paper started in 2009. It has been
implemented and tested
extensively in several generations of HPE runtime systems as well as
adopted into the
NVIDIA GPU hardware and drivers for CUDA 4.0 and beyond since 2011.
The availability of
real hardware that support key HPE features gives rise to a rare
opportunity for studying
the effectiveness of the hardware support by running important
benchmarks on real
runtime and hardware. Experimental results show that in a exemplar
heterogeneous
system, peer DMA and doublebuffering, pinned buffers, and
software techniques can
improve the interaccelerator data communication bandwidth by
2. They can
also improve the execution speed by 1.6 for a 3D finite
difference, 2.5
for 1D FFT, and 1.6 for merge sort, all measured on real
hardware. The proposed
architecture support enables the HPE runtime to transparently deploy
these optimizations
under simple portable user code, allowing system designers to freely
employ devices of
different capabilities. We further argue that simple interfaces such HPE
are needed for
most applications to benefit from advanced hardware features in
practice.
Download Full Text
The manuscripts available on our site are provided for your personal
use only and may not be retransmitted or redistributed without written
permissions from the paper's publisher and author. You may not upload any
of this site's material to any public server, on-line service, network, or
bulletin board without prior written permission from the publisher and
author. You may not make copies for any commercial purpose. Reproduction
or storage of materials retrieved from this web site is subject to the
U.S. Copyright Act of 1976, Title 17 U.S.C.