Auto-vectorization of interleaved data for SIMD

Authors:
Dorit Nuzman;Ira Rosen;Ayal Zaks
Affiliations:
IBM Haifa Labs, Haifa, Israel;IBM Haifa Labs, Haifa, Israel;IBM Haifa Labs, Haifa, Israel
Venue:
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Year:
2006

Citing 27
Cited 31

Practical dependence testing

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Exploiting SIMD parallelism in DSP and multimedia algorithms using the AltiVec technology

ICS '99 Proceedings of the 13th international conference on Supercomputing
Exploiting a new level of DLP in multimedia applications

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Exploiting superword level parallelism with multimedia instruction sets

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
Tarantula: a vector extension to the alpha architecture

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
MMX Technology Extension to the Intel Architecture

IEEE Micro
AltiVec Extension to PowerPC Accelerates Media Processing

IEEE Micro
Compiler-Controlled Caching in Superword Register Files for Multimedia Extension Architectures

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Torrent Architecture Manual

Torrent Architecture Manual
Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements

IEEE Transactions on Computers
Vectorizing for a SIMdD DSP architecture

Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
Software Vectorization Handbook, The: Applying Intel Multimedia Extensions for Maximum Performance

Software Vectorization Handbook, The: Applying Intel Multimedia Extensions for Maximum Performance
Vectorization for SIMD architectures with alignment constraints

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
An innovative low-power high-performance programmable signal processor for digital communications

IBM Journal of Research and Development
A High-Performance SIMD Floating Point Unit for BlueGene/L: Architecture, Compilation, and Algorithm Design

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Superword-Level Parallelism in the Presence of Control Flow

Proceedings of the international symposium on Code generation and optimization
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion

Proceedings of the international symposium on Code generation and optimization
An Empirical Study On the Vectorization of Multimedia Applications for Multimedia Extensions

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
SWARP: a retargetable preprocessor for multimedia instructions: Research Articles

Concurrency and Computation: Practice & Experience - Compilers for Parallel Computers
Generation of permutations for SIMD processors

LCTES '05 Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Multi-platform Auto-vectorization

Proceedings of the International Symposium on Code Generation and Optimization
Optimizing data permutations for SIMD devices

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Introduction to the cell multiprocessor

IBM Journal of Research and Development - POWER5 and packaging
Vectorization techniques for the Blue Gene/L double FPU

IBM Journal of Research and Development
Induction variable analysis with delayed abstractions

HiPEAC'05 Proceedings of the First international conference on High Performance Embedded Architectures and Compilers

Optimizing data permutations for SIMD devices

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Using SIMD registers and instructions to enable instruction-level parallelism in sorting algorithms

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Sams: single-affiliation multiple-stride parallel memory scheme

Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?
Exploiting SIMD Parallelism with the CGiS Compiler Framework

Languages and Compilers for Parallel Computing
Outer-loop vectorization: revisited for short SIMD architectures

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
On the exploitation of loop-level parallelism in embedded applications

ACM Transactions on Embedded Computing Systems (TECS)
A SIMD optimization framework for retargetable compilers

ACM Transactions on Architecture and Code Optimization (TACO)
Generation of Pack Instruction Sequence for Media Processors Using Multi-Valued Decision Diagram

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences
Access-pattern-aware on-chip memory allocation for SIMD processors

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
MacroSS: macro-SIMDization of streaming applications

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Dependence-based code generation for a CELL processor

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
Generating SIMD vectorized permutations

CC'08/ETAPS'08 Proceedings of the Joint European Conferences on Theory and Practice of Software 17th international conference on Compiler construction
SAMS multi-layout memory: providing multiple views of data to boost SIMD performance

Proceedings of the 24th ACM International Conference on Supercomputing
Vectorization for Java

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
A new compilation technique for SIMD code generation across basic block boundaries

Proceedings of the 2010 Asia and South Pacific Design Automation Conference
A taxonomy of accelerator architectures and their programming models

IBM Journal of Research and Development
Data layout transformation for stencil computations on short-vector SIMD architectures

CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
Automatic SIMD vectorization of fast fourier transforms for the larrabee and AVX instruction sets

Proceedings of the international conference on Supercomputing
Using machine learning to improve automatic vectorization

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Efficient SIMD code generation for irregular kernels

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Mapping streaming languages to general purpose processors through vectorization

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
A compiler framework for extracting superword level parallelism

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Dynamic trace-based analysis of vectorization potential of applications

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Data layout optimization for multi-valued containers in OpenCL

Journal of Parallel and Distributed Computing
Extending OpenMP* with vector constructs for modern multicore SIMD architectures

IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Polyhedral parallel code generation for CUDA

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
From relational verification to SIMD loop synthesis

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
When polyhedral transformations meet SIMD code generation

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
Breaking SIMD shackles with an exposed flexible microarchitecture and the access execute PDG

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Vectorization past dependent branches through speculation

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
A Basic Linear Algebra Compiler

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most implementations of the Single Instruction Multiple Data (SIMD) model available today require that data elements be packed in vector registers. Operations on disjoint vector elements are not supported directly and require explicit data reorganization manipulations. Computations on non-contiguous and especially interleaved data appear in important applications, which can greatly benefit from SIMD instructions once the data is reorganized properly. Vectorizing such computations efficiently is therefore an ambitious challenge for both programmers and vectorizing compilers. We demonstrate an automatic compilation scheme that supports effective vectorization in the presence of interleaved data with constant strides that are powers of 2, facilitating data reorganization. We demonstrate how our vectorization scheme applies to dominant SIMD architectures, and present experimental results on a wide range of key kernels, showing speedups in execution time up to 3.7 for interleaving levels (stride) as high as 8.