Portable and Transparent Host-Device Communication Optimization for GPGPU Environments

Authors:
Christos Margiolas;Michael F. P. O'Boyle
Affiliations:
School of Informatics University of Edinburgh;School of Informatics University of Edinburgh
Venue:
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Year:
2014

Citing 21
Cited 0

Efficiently computing static single assignment form and the control dependence graph

ACM Transactions on Programming Languages and Systems (TOPLAS)
Circle fitting by linear and nonlinear least squares

Journal of Optimization Theory and Applications
Composing high-performance memory allocators

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Adaptive java optimisation using instance-based learning

Proceedings of the 18th annual international conference on Supercomputing
CUBA: an architecture for efficient CPU/co-processor data communication

Proceedings of the 22nd annual international conference on Supercomputing
Software Pipelined Execution of Stream Programs on GPUs

Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
Rodinia: A benchmark suite for heterogeneous computing

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Profiling General Purpose GPU Applications

SBAC-PAD '09 Proceedings of the 2009 21st International Symposium on Computer Architecture and High Performance Computing
An adaptive performance modeling tool for GPU architectures

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
An asymmetric distributed shared memory model for heterogeneous parallel systems

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
A GPGPU compiler for memory optimization and parallelism management

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
An experimental approach to performance measurement of heterogeneous parallel applications using CUDA

Proceedings of the 24th ACM International Conference on Supercomputing
Achieving a single compute device image in OpenCL for multiple GPUs

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Fermi GF100 GPU Architecture

IEEE Micro
Automatic CPU-GPU communication management and optimization

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
CudaDMA: optimizing GPU memory bandwidth via warp specialization

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Dymaxion: optimizing memory access patterns for heterogeneous systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Compiling a high-level language for GPUs: (via language support for architectures and compilers)

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Dynamically managed data for CPU-GPU architectures

Proceedings of the Tenth International Symposium on Code Generation and Optimization
A compiler-assisted runtime-prefetching scheme for heterogeneous platforms

IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Smart, adaptive mapping of parallelism in the presence of external workload

CGO '13 Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

General purpose graphics processors units (GPU) provide the potential for high computational performance with reduced cost and power. Typically they are employed in heterogeneous settings acting as accelerators. Here an application resides on a host multi-core, dispatching work to the GPU. However, workload dispatch is frequently accompanied by large scale data transfers between the host main memory and the dedicated memories of the GPUs. For many applications, memory allocation and communication overhead can severely reduce the benefits of GPU acceleration. This paper develops an approach that reduces host-device communication overhead for OpenCL applications. It does this without modification or recompilation of the application source code and is portable across platforms. It achieves this by tracing and analyzing calls to the runtime made by the application and then selecting the best platform specific memory allocation and communication policy. This approach was applied to 12 existing OpenCL benchmarks from Parboil and Rodinia suites on 3 different platforms where it gives on average a speedup of 1.51, 1.31 and 1.48, respectively. In certain cases, our approach leads up to a factor of three times improvement over current approaches.