Speculative parallelization on GPGPUs

Authors:
Min Feng;Rajiv Gupta;Laxmi N. Bhuyan
Affiliations:
University of California, Riverside, Riverside, CA, USA;University of California, Riverside, Riverside, CA, USA;University of California, Riverside, Riverside, CA, USA
Venue:
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Year:
2012

Citing 5
Cited 0

Semantical interprocedural parallelization: an overview of the PIPS project

ICS '91 Proceedings of the 5th international conference on Supercomputing
LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
A GPGPU compiler for memory optimization and parallelism management

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
On-the-fly elimination of dynamic irregularities for GPU computing

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper overviews the first speculative parallelization technique for GPUs that can exploit parallelism in loops even in the presence of dynamic irregularities that may give rise to cross-iteration dependences. The execution of a speculatively parallelized loop consists of five phases: scheduling, computation, misspeculation check, result committing, and misspeculation recovery. We perform misspeculation check on the GPU to minimize its cost. We optimize the procedures of result committing and misspeculation recovery to reduce the result copying and recovery overhead. Finally, the scheduling policies are designed according to the types of cross-iteration dependences to reduce the misspeculation rate. Our preliminary evaluation was conducted on an nVidia Tesla C1060 hosted in an Intel(R) Xeon(R) E5540 machine. We use three benchmarks of which two contain irregular memory accesses and one contain irregular control flows that can give rise to cross-iteration dependences. Our implementation achieves 3.6x-13.8x speedups for loops in these benchmarks.