Data transformations enabling loop vectorization on multithreaded data parallel architectures

Authors:
Byunghyun Jang;Perhaad Mistry;Dana Schaa;Rodrigo Dominguez;David Kaeli
Affiliations:
Northeastern University, Boston, MA, USA;Northeastern University, Boston, MA, USA;Northeastern University, Boston, MA, USA;Northeastern University, Boston, MA, USA;Northeastern University, Boston, MA, USA
Venue:
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Year:
2010

Citing 4
Cited 4

Cache miss equations: an analytical representation of cache misses

ICS '97 Proceedings of the 11th international conference on Supercomputing
Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Architecture-aware optimization targeting multithreaded stream computing

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Multi GPU implementation of iterative tomographic reconstruction algorithms

ISBI'09 Proceedings of the Sixth IEEE international conference on Symposium on Biomedical Imaging: From Nano to Macro

Accelerating the local outlier factor algorithm on a GPU for intrusion detection systems

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Data layout transformation exploiting memory-level parallelism in structured grid many-core applications

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Data layout transformation for stencil computations on short-vector SIMD architectures

CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
GPU Performance Enhancement via Communication Cost Reduction: Case Studies of Radix Sort and WSN Relay Node Placement Problem

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Loop vectorization, a key feature exploited to obtain high performance on Single Instruction Multiple Data (SIMD) vector architectures, is significantly hindered by irregular memory access patterns in the data stream. This paper describes data transformations that allow us to vectorize loops targeting massively multithreaded data parallel architectures. We present a mathematical model that captures loop-based memory access patterns and computes the most appropriate data transformations in order to enable vectorization. Our experimental results show that the proposed data transformations can significantly increase the number of loops that can be vectorized and enhance the data-level parallelism of applications. Our results also show that the overhead associated with our data transformations can be easily amortized as the size of the input data set increases. For the set of high performance benchmark kernels studied, we achieve consistent and significant performance improvements (up to 11.4X) by applying vectorization using our data transformation approach.