Optimizing data permutations for SIMD devices

Authors:
Gang Ren;Peng Wu;David Padua
Affiliations:
University of Illinois at Urbana-Champ., Urbana, IL;IBM T.J. Watson Research Center, Yorktown Heights, NY;University of Illinois at Urbana-Champ., Urbana, IL
Venue:
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Year:
2006

Citing 19
Cited 22

Automatic array alignment in data-parallel programs

POPL '93 Proceedings of the 20th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
The Complexity of Multiterminal Cuts

SIAM Journal on Computing
An array operation synthesis scheme to optimize Fortran 90 programs

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Exploiting superword level parallelism with multimedia instruction sets

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Compilation techniques for multimedia processors

International Journal of Parallel Programming - Special issue on instruction-level parallelism and parallelizing compilation, Part 1
A vectorizing compiler for multimedia extensions

International Journal of Parallel Programming - Special issue on instruction-level parallelism and parallelizing compilation, Part 1
SPL: a language and compiler for DSP algorithms

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Code Optimization Techniques for Embedded Processors: Methods, Algorithms, and Tools

Code Optimization Techniques for Embedded Processors: Methods, Algorithms, and Tools
Increasing and Detecting Memory Address Congruence

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Vectorizing for a SIMdD DSP architecture

Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
Software Vectorization Handbook, The: Applying Intel Multimedia Extensions for Maximum Performance

Software Vectorization Handbook, The: Applying Intel Multimedia Extensions for Maximum Performance
Vectorization for SIMD architectures with alignment constraints

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Optimizing Sorting with Genetic Algorithms

Proceedings of the international symposium on Code generation and optimization
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion

Proceedings of the international symposium on Code generation and optimization
An Empirical Study On the Vectorization of Multimedia Applications for Multimedia Extensions

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Programming by sketching for bit-streaming programs

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Generation of permutations for SIMD processors

LCTES '05 Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Auto-vectorization of interleaved data for SIMD

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
HiLO: high level optimization of FFTs

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing

Auto-vectorization of interleaved data for SIMD

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Combining analytical and empirical approaches in tuning matrix transposition

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Optimal bit-reversal using vector permutations

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Using SIMD registers and instructions to enable instruction-level parallelism in sorting algorithms

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
An exact data dependence testing method for quadratic expressions

Information Sciences: an International Journal
CellSort: high performance sorting on the cell processor

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Sams: single-affiliation multiple-stride parallel memory scheme

Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?
Efficient vectorization of SIMD programs with non-aligned and irregular data access hardware

CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
On the exploitation of loop-level parallelism in embedded applications

ACM Transactions on Embedded Computing Systems (TECS)
CUDA-Lite: Reducing GPU Programming Complexity

Languages and Compilers for Parallel Computing
A SIMD optimization framework for retargetable compilers

ACM Transactions on Architecture and Code Optimization (TACO)
MacroSS: macro-SIMDization of streaming applications

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
New algorithms for SIMD alignment

CC'07 Proceedings of the 16th international conference on Compiler construction
Generating SIMD vectorized permutations

CC'08/ETAPS'08 Proceedings of the Joint European Conferences on Theory and Practice of Software 17th international conference on Compiler construction
SAMS multi-layout memory: providing multiple views of data to boost SIMD performance

Proceedings of the 24th ACM International Conference on Supercomputing
Automatic SIMD vectorization of fast fourier transforms for the larrabee and AVX instruction sets

Proceedings of the international conference on Supercomputing
Efficient SIMD code generation for irregular kernels

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
A compiler framework for extracting superword level parallelism

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Data layout optimization for multi-valued containers in OpenCL

Journal of Parallel and Distributed Computing
Breaking SIMD shackles with an exposed flexible microarchitecture and the access execute PDG

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Data-parallel finite-state machines

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Easy, fast, and energy-efficient object detection on heterogeneous on-chip architectures

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.01

Visualization

Abstract

The widespread presence of SIMD devices in today's microprocessors has made compiler techniques for these devices tremendously important. One of the most important and difficult issues that must be addressed by these techniques is the generation of the data permutation instructions needed for non-contiguous and misaligned memory references. These instructions are expensive and, therefore, it is of crucial importance to minimize their number to improve performance and, in many cases, enable speedups over scalar code.Although it is often difficult to optimize an isolated data reorganization operation, a collection of related data permutations can often be manipulated to reduce the number of operations. This paper presents a strategy to optimize all forms of data permutations. The strategy is organized into three steps. First, all data permutations in the source program are converted into a generic representation. These permutations can originate from vector accesses to non-contiguous and misaligned memory locations or result from compiler transformations. Second, an optimization algorithm is applied to reduce the number of data permutations in a basic block. By propagating permutations across statements and merging consecutive permutations whenever possible, the algorithm can significantly reduce the number of data permutations. Finally, a code generation algorithm translates generic permutation operations into native permutation instructions for the target platform. Experiments were conducted on various kinds of applications. The results show that up to 77% of the permutation instructions are eliminated and, as a result, the average performance improvement is 48% on VMX and 68% on SSE2. For several applications, near perfect speedups have been achieved on both platforms.