Vapor SIMD: Auto-vectorize once, run everywhere

Authors:
Dorit Nuzman;Sergei Dyshel;Erven Rohou;Ira Rosen;Kevin Williams;David Yuste;Albert Cohen;Ayal Zaks
Affiliations:
IBM Haifa Research Lab, Haifa, Israel - HiPEAC;IBM Haifa Research Lab, Haifa, Israel - HiPEAC;INRIA Rennes - Bretagne Atlantique, Rennes, France - HiPEAC;IBM Haifa Research Lab, Haifa, Israel - HiPEAC;INRIA Rennes - Bretagne Atlantique, Rennes, France - HiPEAC;INRIA Rennes - Bretagne Atlantique, Rennes, France - HiPEAC;INRIA Saclay - Île-de-France, Orsay, France - HiPEAC;IBM Haifa Research Lab, Haifa, Israel - HiPEAC
Venue:
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Year:
2011

Citing 21
Cited 8

Exploiting superword level parallelism with multimedia instruction sets

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
Speculative dynamic vectorization

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Compiler-Controlled Caching in Superword Register Files for Multimedia Extension Architectures

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Increasing and Detecting Memory Address Congruence

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Software Vectorization Handbook, The: Applying Intel Multimedia Extensions for Maximum Performance

Software Vectorization Handbook, The: Applying Intel Multimedia Extensions for Maximum Performance
Vectorization for SIMD architectures with alignment constraints

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Superword-Level Parallelism in the Presence of Control Flow

Proceedings of the international symposium on Code generation and optimization
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion

Proceedings of the international symposium on Code generation and optimization
An integrated simdization framework using virtual vectors

Proceedings of the 19th annual international conference on Supercomputing
Multi-platform Auto-vectorization

Proceedings of the International Symposium on Code Generation and Optimization
Optimizing Dynamic Binary Translation for SIMD Instructions

Proceedings of the International Symposium on Code Generation and Optimization
Vector LLVA: a virtual vector instruction set for media processing

Proceedings of the 2nd international conference on Virtual execution environments
Liquid SIMD: Abstracting SIMD Hardware using Lightweight Dynamic Mapping

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Outer-loop vectorization: revisited for short SIMD architectures

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Polyhedral-Model Guided Loop-Nest Auto-Vectorization

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Processor virtualization and split compilation for heterogeneous multicore embedded systems

Proceedings of the 47th Design Automation Conference
Vectorization for Java

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Combined Iterative and Model-driven Optimization in an Automatic Parallelization Framework

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Speculatively vectorized bytecode

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Intel's Array Building Blocks: A retargetable, dynamic compiler and embedded language

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization

SIMD defragmenter: efficient ILP realization on data-parallel architectures

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Intel's Array Building Blocks: A retargetable, dynamic compiler and embedded language

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Vectorization technology to improve interpreter performance

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Hybrid type legalization for a sparse SIMD instruction set

ACM Transactions on Architecture and Code Optimization (TACO)
A Basic Linear Algebra Compiler

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
On the advantage of time-varying diversity of workload on functionally asymmetric multi-core

Proceedings of International Workshop on Adaptive Self-tuning Computing Systems
JIT technology with C/C++: Feedback-directed dynamic recompilation for statically compiled languages

ACM Transactions on Architecture and Code Optimization (TACO)
Boost.SIMD: generic programming for portable SIMDization

Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Just-in-Time (JIT) compiler technology offers portability while facilitating target- and context-specific specialization. Single-Instruction-Multiple-Data (SIMD) hardware is ubiquitous and markedly diverse, but can be difficult for JIT compilers to efficiently target due to resource and budget constraints. We present our design for a synergistic auto-vectorizing compilation scheme. The scheme is composed of an aggressive, generic offline stage coupled with a lightweight, target-specific online stage. Our method leverages the optimized intermediate results provided by the first stage across disparate SIMD architectures from different vendors, having distinct characteristics ranging from different vector sizes, memory alignment and access constraints, to special computational idioms. We demonstrate the effectiveness of our design using a set of kernels that exercise innermost loop, outer loop, as well as straight-line code vectorization, all automatically extracted by the common offline compilation stage. This results in performance comparable to that provided by specialized monolithic offline compilers. Our framework is implemented using open-source tools and standards, thereby promoting interoperability and extendibility.