Automatic array alignment in data-parallel programs
POPL '93 Proceedings of the 20th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
The Complexity of Multiterminal Cuts
SIAM Journal on Computing
An array operation synthesis scheme to optimize Fortran 90 programs
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Exploiting superword level parallelism with multimedia instruction sets
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Compilation techniques for multimedia processors
International Journal of Parallel Programming - Special issue on instruction-level parallelism and parallelizing compilation, Part 1
A vectorizing compiler for multimedia extensions
International Journal of Parallel Programming - Special issue on instruction-level parallelism and parallelizing compilation, Part 1
SPL: a language and compiler for DSP algorithms
Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Code Optimization Techniques for Embedded Processors: Methods, Algorithms, and Tools
Code Optimization Techniques for Embedded Processors: Methods, Algorithms, and Tools
Increasing and Detecting Memory Address Congruence
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Vectorizing for a SIMdD DSP architecture
Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
Software Vectorization Handbook, The: Applying Intel Multimedia Extensions for Maximum Performance
Software Vectorization Handbook, The: Applying Intel Multimedia Extensions for Maximum Performance
Vectorization for SIMD architectures with alignment constraints
Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Optimizing Sorting with Genetic Algorithms
Proceedings of the international symposium on Code generation and optimization
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion
Proceedings of the international symposium on Code generation and optimization
An Empirical Study On the Vectorization of Multimedia Applications for Multimedia Extensions
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Programming by sketching for bit-streaming programs
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Generation of permutations for SIMD processors
LCTES '05 Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Auto-vectorization of interleaved data for SIMD
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
HiLO: high level optimization of FFTs
LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Auto-vectorization of interleaved data for SIMD
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Combining analytical and empirical approaches in tuning matrix transposition
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Optimal bit-reversal using vector permutations
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Using SIMD registers and instructions to enable instruction-level parallelism in sorting algorithms
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
An exact data dependence testing method for quadratic expressions
Information Sciences: an International Journal
CellSort: high performance sorting on the cell processor
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Sams: single-affiliation multiple-stride parallel memory scheme
Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?
Efficient vectorization of SIMD programs with non-aligned and irregular data access hardware
CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
On the exploitation of loop-level parallelism in embedded applications
ACM Transactions on Embedded Computing Systems (TECS)
CUDA-Lite: Reducing GPU Programming Complexity
Languages and Compilers for Parallel Computing
A SIMD optimization framework for retargetable compilers
ACM Transactions on Architecture and Code Optimization (TACO)
MacroSS: macro-SIMDization of streaming applications
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
New algorithms for SIMD alignment
CC'07 Proceedings of the 16th international conference on Compiler construction
Generating SIMD vectorized permutations
CC'08/ETAPS'08 Proceedings of the Joint European Conferences on Theory and Practice of Software 17th international conference on Compiler construction
SAMS multi-layout memory: providing multiple views of data to boost SIMD performance
Proceedings of the 24th ACM International Conference on Supercomputing
Automatic SIMD vectorization of fast fourier transforms for the larrabee and AVX instruction sets
Proceedings of the international conference on Supercomputing
Efficient SIMD code generation for irregular kernels
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
A compiler framework for extracting superword level parallelism
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Data layout optimization for multi-valued containers in OpenCL
Journal of Parallel and Distributed Computing
Breaking SIMD shackles with an exposed flexible microarchitecture and the access execute PDG
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Data-parallel finite-state machines
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Easy, fast, and energy-efficient object detection on heterogeneous on-chip architectures
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.01 |
The widespread presence of SIMD devices in today's microprocessors has made compiler techniques for these devices tremendously important. One of the most important and difficult issues that must be addressed by these techniques is the generation of the data permutation instructions needed for non-contiguous and misaligned memory references. These instructions are expensive and, therefore, it is of crucial importance to minimize their number to improve performance and, in many cases, enable speedups over scalar code.Although it is often difficult to optimize an isolated data reorganization operation, a collection of related data permutations can often be manipulated to reduce the number of operations. This paper presents a strategy to optimize all forms of data permutations. The strategy is organized into three steps. First, all data permutations in the source program are converted into a generic representation. These permutations can originate from vector accesses to non-contiguous and misaligned memory locations or result from compiler transformations. Second, an optimization algorithm is applied to reduce the number of data permutations in a basic block. By propagating permutations across statements and merging consecutive permutations whenever possible, the algorithm can significantly reduce the number of data permutations. Finally, a code generation algorithm translates generic permutation operations into native permutation instructions for the target platform. Experiments were conducted on various kinds of applications. The results show that up to 77% of the permutation instructions are eliminated and, as a result, the average performance improvement is 48% on VMX and 68% on SSE2. For several applications, near perfect speedups have been achieved on both platforms.