Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures

Authors:
Byunghyun Jang;Dana Schaa;Perhaad Mistry;David Kaeli
Affiliations:
Northeastern University, Boston;Northeastern University, Boston;Northeastern University, Boston;Northeastern University, Boston
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
2011

Citing 0
Cited 14

Reducing branch divergence in GPU programs

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Dymaxion: optimizing memory access patterns for heterogeneous systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Reducing thread divergence in GPU-based b&b applied to the flow-shop problem

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Shared memory multiplexing: a novel way to improve GPGPU throughput

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Multi2Sim: a simulation framework for CPU-GPU computing

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Valar: a benchmark suite to study the dynamic behavior of heterogeneous systems

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Parallel multi-objective Ant Programming for classification using GPUs

Journal of Parallel and Distributed Computing
Near-optimal and scalable intrasignal in-place optimization for non-overlapping and irregular access schemes

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Nonzero pattern analysis and memory access optimization in GPU-based sparse LU factorization for circuit simulation

IA^3 '13 Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms
CUDA-NP: realizing nested thread-level parallelism in GPGPU applications

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Performance Evaluation and Optimization Mechanisms for Inter-operable Graphics and Computation on GPUs

Proceedings of Workshop on General Purpose Processing Using GPUs
Recent progress and challenges in exploiting graphics processors in computational fluid dynamics

The Journal of Supercomputing
A scalable and near-optimal representation of access schemes for memory management

ACM Transactions on Architecture and Code Optimization (TACO)
High performance evaluation of evolutionary-mined association rules on GPUs

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The introduction of General-Purpose computation on GPUs (GPGPUs) has changed the landscape for the future of parallel computing. At the core of this phenomenon are massively multithreaded, data-parallel architectures possessing impressive acceleration ratings, offering low-cost supercomputing together with attractive power budgets. Even given the numerous benefits provided by GPGPUs, there remain a number of barriers that delay wider adoption of these architectures. One major issue is the heterogeneous and distributed nature of the memory subsystem commonly found on data-parallel architectures. Application acceleration is highly dependent on being able to utilize the memory subsystem effectively so that all execution units remain busy. In this paper, we present techniques for enhancing the memory efficiency of applications on data-parallel architectures, based on the analysis and characterization of memory access patterns in loop bodies; we target vectorization via data transformation to benefit vector-based architectures (e.g., AMD GPUs) and algorithmic memory selection for scalar-based architectures (e.g., NVIDIA GPUs). We demonstrate the effectiveness of our proposed methods with kernels from a wide range of benchmark suites. For the benchmark kernels studied, we achieve consistent and significant performance improvements (up to 11.4{\times} and 13.5{\times} over baseline GPU implementations on each platform, respectively) by applying our proposed methodology.