Exploiting superword level parallelism with multimedia instruction sets

Authors:
Samuel Larsen;Saman Amarasinghe
Affiliations:
MIT Laboratory for Computer Science, Cambridge, MA;MIT Laboratory for Computer Science, Cambridge, MA
Venue:
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Year:
2000

Citing 13
Cited 99

Compiling Fortran 8x array features for the connection machine computer system

PPEALS '88 Proceedings of the ACM/SIGPLAN conference on Parallel programming: experience with applications, languages and systems
Evaluation of Fortran vector compilers and preprocessors

Software—Practice & Experience
SUIF: an infrastructure for research on parallelizing and optimizing compilers

ACM SIGPLAN Notices
Initial results on the performance and cost of vector microprocessors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Simple vector microprocessors for multimedia applications

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Pointer analysis for multithreaded programs

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Bidwidth analysis with application to silicon compilation

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Dependence graphs and compiler optimizations

POPL '81 Proceedings of the 8th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
How Multimedia Workloads Will Change Processor Design

Computer
VIS Speeds New Media Processing

IEEE Micro
MicroUnity's MediaProcessor Architecture

IEEE Micro
MMX Technology Extension to the Intel Architecture

IEEE Micro
Subword Parallelism with MAX-2

IEEE Micro

Bidwidth analysis with application to silicon compilation

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
C Compiler Design for an Industrial Network Processor

OM '01 Proceedings of the 2001 ACM SIGPLAN workshop on Optimization of middleware and distributed systems
Energy aware compilation for DSPs with SIMD instructions

Proceedings of the joint conference on Languages, compilers and tools for embedded systems: software and compilers for embedded systems
The architecture of the DIVA processing-in-memory chip

ICS '02 Proceedings of the 16th international conference on Supercomputing
Bit section instruction set extension of ARM for embedded applications

CASES '02 Proceedings of the 2002 international conference on Compilers, architecture, and synthesis for embedded systems
Automatic intra-register vectorization for the Intel architecture

International Journal of Parallel Programming
Bitwidth aware global register allocation

POPL '03 Proceedings of the 30th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Automatic Intra-Register Vectorization for the Intel® Architecture

International Journal of Parallel Programming
Measuring the Performance of Multimedia Instruction Sets

IEEE Transactions on Computers
Compiler-Controlled Caching in Superword Register Files for Multimedia Extension Architectures

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Increasing and Detecting Memory Address Congruence

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
A Representation for Bit Section Based Analysis and Optimization

CC '02 Proceedings of the 11th International Conference on Compiler Construction
Data Compression Transformations for Dynamically Allocated Data Structures

CC '02 Proceedings of the 11th International Conference on Compiler Construction
Macro Extension for SIMD Processing

Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
Cg: a system for programming graphics hardware in a C-like language

ACM SIGGRAPH 2003 Papers
Vectorization for SIMD architectures with alignment constraints

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Speculative software management of datapath-width for energy optimization

Proceedings of the 2004 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Compiler based exploration of DSP energy savings by SIMD operations

Proceedings of the 2004 Asia and South Pacific Design Automation Conference
An extended ANSI C for processors with a multimedia extension

International Journal of Parallel Programming
A High-Performance SIMD Floating Point Unit for BlueGene/L: Architecture, Compilation, and Algorithm Design

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Register Packing: Exploiting Narrow-Width Operands for Reducing Register File Pressure

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Predicting Unroll Factors Using Supervised Classification

Proceedings of the international symposium on Code generation and optimization
Superword-Level Parallelism in the Presence of Control Flow

Proceedings of the international symposium on Code generation and optimization
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion

Proceedings of the international symposium on Code generation and optimization
Unlocking the Performance of the BlueGene/L Supercomputer

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
A Prototype Processing-In-Memory (PIM) Chip for the Data-Intensive Architecture (DIVA) System

Journal of VLSI Signal Processing Systems
An Empirical Study On the Vectorization of Multimedia Applications for Multimedia Extensions

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
An efficient way to filter out data dependences with a sufficiently large distance between memory references

ACM SIGPLAN Notices
Instruction combining for coalescing memory accesses using global code motion

MSP '04 Proceedings of the 2004 workshop on Memory system performance
Generation of permutations for SIMD processors

LCTES '05 Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Improving superword level parallelism support in modern compilers

CODES+ISSS '05 Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
An integrated simdization framework using virtual vectors

Proceedings of the 19th annual international conference on Supercomputing
Optimizing Compiler for the CELL Processor

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Exploiting Vector Parallelism in Software Pipelined Loops

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Multi-platform Auto-vectorization

Proceedings of the International Symposium on Code Generation and Optimization
Synergistic Processing in Cell's Multicore Architecture

IEEE Micro
Optimizing data permutations for SIMD devices

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Auto-vectorization of interleaved data for SIMD

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Vector LLVA: a virtual vector instruction set for media processing

Proceedings of the 2nd international conference on Virtual execution environments
Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture

IBM Systems Journal
A case for a complexity-effective, width-partitioned microarchitecture

ACM Transactions on Architecture and Code Optimization (TACO)
A new idiom recognition framework for exploiting hardware-assist instructions

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Retargetable code optimization with SIMD instructions

CODES+ISSS '06 Proceedings of the 4th international conference on Hardware/software codesign and system synthesis
Pack instruction generation for media pUsing multi-valued decision diagram

CODES+ISSS '06 Proceedings of the 4th international conference on Hardware/software codesign and system synthesis
Limitations of special-purpose instructions for similarity measurements in media SIMD extensions

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Interactive presentation: SoftSIMD - exploiting subword parallelism using source code transformations

Proceedings of the conference on Design, automation and test in Europe
On SPARC LEON-2 ISA extensions experiments for MPEG encoding acceleration

VLSI Design
On SPARC LEON-2 ISA extensions experiments for MPEG encoding acceleration

VLSI Design
Compiling for vector-thread architectures

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Versatility of extended subwords and the matrix register file

ACM Transactions on Architecture and Code Optimization (TACO)
Outer-loop vectorization: revisited for short SIMD architectures

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
A SIMD optimization framework for retargetable compilers

ACM Transactions on Architecture and Code Optimization (TACO)
Generation of Pack Instruction Sequence for Media Processors Using Multi-Valued Decision Diagram

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences
Evaluating compiler technology for control-flow optimizations for multimedia extension architectures

Microprocessors & Microsystems
A case study on compiler optimizations for the Intel® Core™ 2 duo processor

International Journal of Parallel Programming
Automatic parallelization for graphics processing units

PPPJ '09 Proceedings of the 7th International Conference on Principles and Practice of Programming in Java
Orthogonal parallel processing in vector Pascal

Computer Languages, Systems and Structures
Design and exploitation of a high-performance SIMD floating-point unit for Blue Gene/L

IBM Journal of Research and Development
Vectorization techniques for the Blue Gene/L double FPU

IBM Journal of Research and Development
MacroSS: macro-SIMDization of streaming applications

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Dependence-based code generation for a CELL processor

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
New algorithms for SIMD alignment

CC'07 Proceedings of the 16th international conference on Compiler construction
Runtime Reconfiguration of Multiprocessors Based on Compile-Time Analysis

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Automatic vector instruction selection for dynamic compilation

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Vectorization for Java

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
A new compilation technique for SIMD code generation across basic block boundaries

Proceedings of the 2010 Asia and South Pacific Design Automation Conference
Efficient Selection of Vector Instructions Using Dynamic Programming

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Symbolic crosschecking of floating-point and SIMD code

Proceedings of the sixth conference on Computer systems
Data layout transformation for stencil computations on short-vector SIMD architectures

CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
Using machine learning to improve automatic vectorization

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Scheduling latency insensitive computer vision tasks

ISPA'05 Proceedings of the Third international conference on Parallel and Distributed Processing and Applications
Compiler technology for blue gene systems

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Automatically tuned FFTs for bluegene/l's double FPU

VECPAR'04 Proceedings of the 6th international conference on High Performance Computing for Computational Science
Boosting the performance of multimedia applications using SIMD instructions

CC'05 Proceedings of the 14th international conference on Compiler Construction
Overflow controlled SIMD arithmetic

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Automatic detection of saturation and clipping idioms

LCPC'02 Proceedings of the 15th international conference on Languages and Compilers for Parallel Computing
Efficient SIMD code generation for irregular kernels

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Extending a C-like language for portable SIMD programming

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
SIMD defragmenter: efficient ILP realization on data-parallel architectures

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Bit-sliced datapath for energy-efficient high performance microprocessors

PACS'04 Proceedings of the 4th international conference on Power-Aware Computer Systems
Enhanced bitwidth-aware register allocation

CC'06 Proceedings of the 15th international conference on Compiler Construction
Whole-function vectorization

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Vapor SIMD: Auto-vectorize once, run everywhere

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Scout: a source-to-source transformator for SIMD-Optimizations

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
A compiler framework for extracting superword level parallelism

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Dynamic trace-based analysis of vectorization potential of applications

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Extending OpenMP* with vector constructs for modern multicore SIMD architectures

IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
A SWP specification for sequential image processing algorithms

ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture
Instruction selection for subword level parallelism optimizations for application specific instruction processors

ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Parallel execution of Java loops on Graphics Processing Units

Science of Computer Programming
When polyhedral transformations meet SIMD code generation

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
Hybrid type legalization for a sparse SIMD instruction set

ACM Transactions on Architecture and Code Optimization (TACO)
Idiom recognition framework using topological embedding

ACM Transactions on Architecture and Code Optimization (TACO)
Breaking SIMD shackles with an exposed flexible microarchitecture and the access execute PDG

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Vectorization past dependent branches through speculation

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Efficient multimedia coprocessor with enhanced SIMD engines for exploiting ILP and DLP

Parallel Computing
Exploring the vectorization of python constructs using pythran and boost SIMD

Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing
Sierra: a SIMD extension for C++

Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing
Loop Transforming for Reducing Data Alignment on Multi-Core SIMD Processors

Journal of Signal Processing Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Increasing focus on multimedia applications has prompted the additionof multimedia extensions to most existing general purpose microprocessors. This added functionality comes primarily with the addition of short SIMD instructions. Unfortunately, access to these instructions is limited to in-line assembly and library calls. Generally, it has been assumed that vector compilers provide the most promising means of exploiting multimedia instructions. Although vectorization technology is well understood, it is inherently complex and fragile. In addition, it is incapable of locating SIMD-style parallelism within a basic block.In this paper we introduce the concept of Superword Level Parallelism (SLP) ,a novel way of viewing parallelism in multimedia and scientific applications. We believe SLPP is fundamentally different from the loop level parallelism exploited by traditional vector processing, and therefore demands a new method of extracting it. We have developed a simple and robust compiler for detecting SLPP that targets basic blocks rather than loop nests. As with techniques designed to extract ILP, ours is able to exploit parallelism both across loop iterations and within basic blocks. The result is an algorithm that provides excellent performance in several application domains. In our experiments, dynamic instruction counts were reduced by 46%. Speedups ranged from 1.24 to 6.70.