Exploiting Vector Parallelism in Software Pipelined Loops

Authors:
Samuel Larsen;Rodric Rabbah;Saman Amarasinghe
Affiliations:
MIT;MIT;MIT
Venue:
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Year:
2005

Citing 33
Cited 8

Software pipelining: an effective scheduling technique for VLIW machines

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Register allocation for software pipelined loops

PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
Code generation schema for modulo scheduled loops

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Lifetime-sensitive modulo scheduling

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
SUIF: an infrastructure for research on parallelizing and optimizing compilers

ACM SIGPLAN Notices
Stage scheduling: a technique to reduce the register requirements of a modulo schedule

Proceedings of the 28th annual international symposium on Microarchitecture
Modulo scheduling of loops in control-intensive non-numeric programs

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Effective cluster assignment for modulo scheduling

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Exploiting superword level parallelism with multimedia instruction sets

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Compilation techniques for multimedia processors

International Journal of Parallel Programming - Special issue on instruction-level parallelism and parallelizing compilation, Part 1
A vectorizing compiler for multimedia extensions

International Journal of Parallel Programming - Special issue on instruction-level parallelism and parallelizing compilation, Part 1
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
Graph-partitioning based instruction scheduling for clustered processors

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Modulo scheduling with integrated register spilling for clustered VLIW architectures

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
VIS Speeds New Media Processing

IEEE Micro
Subword Parallelism with MAX-2

IEEE Micro
The TigerSHARC DSP Architecture

IEEE Micro
AltiVec Extension to PowerPC Accelerates Media Processing

IEEE Micro
Implementing Streaming SIMD Extensions on the Pentium III Processor

IEEE Micro
A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Compiler-Controlled Caching in Superword Register Files for Multimedia Extension Architectures

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Increasing and Detecting Memory Address Congruence

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Itanium 2 Processor Microarchitecture

IEEE Micro
On the Complexity of Loop Fusion

PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
Vectorizing for a SIMdD DSP architecture

Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
Software Vectorization Handbook, The: Applying Intel Multimedia Extensions for Maximum Performance

Software Vectorization Handbook, The: Applying Intel Multimedia Extensions for Maximum Performance
Vectorization for SIMD architectures with alignment constraints

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Superword-Level Parallelism in the Presence of Control Flow

Proceedings of the international symposium on Code generation and optimization
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion

Proceedings of the international symposium on Code generation and optimization
Generation of permutations for SIMD processors

LCTES '05 Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
An integrated simdization framework using virtual vectors

Proceedings of the 19th annual international conference on Supercomputing
Scalarization on Short Vector Machines

ISPASS '05 Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2005

Implementing virtual memory in a vector processor with software restart markers

Proceedings of the 20th annual international conference on Supercomputing
Compiling for vector-thread architectures

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Generation of Pack Instruction Sequence for Media Processors Using Multi-Valued Decision Diagram

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences
Applying Data Mapping Techniques to Vector DSPs

Journal of Signal Processing Systems
Efficient Selection of Vector Instructions Using Dynamic Programming

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Efficient SIMD code generation for irregular kernels

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Mapping streaming languages to general purpose processors through vectorization

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Instruction selection for subword level parallelism optimizations for application specific instruction processors

ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

An emerging trend in processor design is the addition of short vector instructions to general-purpose and embedded ISAs. Frequently, these extensions are employed using traditional vectorization technology first developed for supercomputers. In contrast, scalar hardware is typically targeted using ILP techniques such as software pipelining. This paper presents a novel approach for exploiting vector parallelism in software pipelined loops. The proposed methodology. Our approach results in better resource utilization and allows for software pipelining with shorter initiation intervals. The proposed optimization is applied in the compiler backend, where vectorization decisions are more amenable to cost analysis. This is unique in that traditional vectorization optimizations are usually carried out at the statement level. Although our technique most naturally complements statically scheduled machines, we believe it is applicable to any architecture that tightly integrates support for instruction and data level parallelism. We evaluate our methodology using nine SPEC FP benchmarks. In comparison to software pipelining, our approach achieves a maximum speedup of 1.38x, with average of 1.11x