Dymaxion: optimizing memory access patterns for heterogeneous systems

Authors:
Shuai Che;Jeremy W. Sheaffer;Kevin Skadron
Affiliations:
University of Virginia;University of Virginia;University of Virginia
Venue:
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Year:
2011

Citing 14
Cited 8

The Impulse Memory Controller

IEEE Transactions on Computers
Scan primitives for GPU computing

Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Merge: a programming model for heterogeneous multi-core systems

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Efficient gather and scatter operations on graphics processors

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Scalable Parallel Programming with CUDA

Queue - GPU Computing
NVIDIA Tesla: A Unified Graphics and Computing Architecture

IEEE Micro
A performance study of general-purpose applications on graphics processors using CUDA

Journal of Parallel and Distributed Computing
Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Rodinia: A benchmark suite for heterogeneous computing

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Micro-pages: increasing DRAM efficiency with locality-aware data placement

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Data layout transformation exploiting memory-level parallelism in structured grid many-core applications

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures

IEEE Transactions on Parallel and Distributed Systems
On-the-fly elimination of dynamic irregularities for GPU computing

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Where is the data? Why you cannot debate CPU vs. GPU performance without the answer

ISPASS '11 Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software

GPU Performance Enhancement via Communication Cost Reduction: Case Studies of Radix Sort and WSN Relay Node Placement Problem

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
A compiler-assisted runtime-prefetching scheme for heterogeneous platforms

IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Nonzero pattern analysis and memory access optimization in GPU-based sparse LU factorization for circuit simulation

IA^3 '13 Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms
Portable and Transparent Host-Device Communication Optimization for GPGPU Environments

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
In-place transposition of rectangular matrices on accelerators

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
An application-centric evaluation of OpenCL on multi-core CPUs

Parallel Computing
APR: A Novel Parallel Repacking Algorithm for Efficient GPGPU Parallel Code Transformation

Proceedings of Workshop on General Purpose Processing Using GPUs

Quantified Score

Hi-index	0.00

Visualization

Abstract

Graphics processors (GPUs) have emerged as an important platform for general purpose computing. GPUs offer a large number of parallel cores and have access to high memory bandwidth; however, data structure layouts in GPU memory often lead to suboptimal performance for programs designed with a CPU memory interface---or no particular memory interface at all!---in mind. This implies that application performance is highly sensitive irregularity in memory access patterns. This issue is all the more important due to the growing disparity between core and DRAM clocks; memory interfaces have increasingly become bottlenecks in computer systems. In this paper, we propose a simple API, Dymaxion, that allows programmers to optimize memory mappings to improve the efficiency of memory accesses on heterogeneous platforms. Use of Dymaxion requires only minimal modifications to existing CUDA programs. Our current framework extends NVIDIA's CUDA API with the addition of memory layout remapping and index transformation. We consider the overhead of layout remapping and effectively hide it through chunking and overlapping with PCI-E transfer. We present the implementation of Dymaxion and its optimizations and evaluate a variety of important memory access patterns. Using four case studies, we are able to achieve 3.3x speedup on GPU kernels and 20% overall performance improvement, including the PCI-E transfer, over the original CUDA implementations on an NVIDIA GTX 480 GPU. We also explore the importance of maintaining per-device data layouts and cross-device data mappings with a case study of concurrent CPU-GPU execution.