Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU

Authors:
Bo Wu;Zhijia Zhao;Eddy Zheng Zhang;Yunlian Jiang;Xipeng Shen
Affiliations:
The College of William and Mary, Williamsburg, VA, USA;The College of William and Mary, Williamsburg, VA, USA;Rutgers University, New Brunswick, NJ, USA;Google, Mountain View, CA, USA;The College of William and Mary, Williamsburg, VA, USA
Venue:
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
2013

Citing 23
Cited 4

Approximation algorithms for NP-hard problems

Approximation algorithms for NP-hard problems
Cacheminer: A Runtime Approach to Exploit Cache Locality on SMP

IEEE Transactions on Parallel and Distributed Systems
Compile-time performance prediction of scientific programs

Compile-time performance prediction of scientific programs
Improving effective bandwidth through compiler enhancement of global cache reuse

Journal of Parallel and Distributed Computing
Cache-conscious coallocation of hot data streams

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Exploiting Locality for Irregular Scientific Codes

IEEE Transactions on Parallel and Distributed Systems
Compilers: Principles, Techniques, and Tools (2nd Edition)

Compilers: Principles, Techniques, and Tools (2nd Edition)
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Optimistic parallelism benefits from data partitioning

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
A compiler framework for optimization of affine loop nests for gpgpus

Proceedings of the 22nd annual international conference on Supercomputing
OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
A control-structure splitting optimization for GPGPU

Proceedings of the 6th ACM conference on Computing frontiers
Increasing memory miss tolerance for SIMD cores

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A GPGPU compiler for memory optimization and parallelism management

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Dynamic warp subdivision for integrated branch and memory divergence tolerance

Proceedings of the 37th annual international symposium on Computer architecture
On-the-fly elimination of dynamic irregularities for GPU computing

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Sponge: portable stream programming on graphics engines

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
An execution strategy and optimized runtime support for parallelizing irregular reductions on modern GPUs

Proceedings of the international conference on Supercomputing
Enhancing locality for recursive traversals of recursive structures

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Dymaxion: optimizing memory access patterns for heterogeneous systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Enhancing Data Locality for Dynamic Simulations through Asynchronous Data Transformations and Adaptive Control

PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
OpenCL as a unified programming model for heterogeneous CPU/GPU clusters

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming

Software-level scheduling to exploit non-uniformly shared data cache on GPGPU

Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
General transformations for GPU execution of tree traversals

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Non-affine Extensions to Polyhedral Code Generation

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
CUDA-NP: realizing nested thread-level parallelism in GPGPU applications

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

The performance of Graphic Processing Units (GPU) is sensitive to irregular memory references. Some recent work shows the promise of data reorganization for eliminating non-coalesced memory accesses that are caused by irregular references. However, all previous studies have employed simple, heuristic methods to determine the new data layouts to create. As a result, they either do not provide any performance guarantee or are effective to only some limited scenarios. This paper contributes a fundamental study to the problem. It systematically analyzes the inherent complexity of the problem in various settings, and for the first time, proves that the problem is NP-complete. It then points out the limitations of existing techniques and reveals that in practice, the essence for designing an appropriate data reorganization algorithm can be reduced to a tradeoff among space, time, and complexity. Based on that insight, it develops two new data reorganization algorithms to overcome the limitations of previous methods. Experiments show that an assembly composed of the new algorithms and a previous algorithm can circumvent the inherent complexity in finding optimal data layouts, making it feasible to minimize non-coalesced memory accesses for a variety of irregular applications and settings that are beyond the reach of existing techniques.