Complementing software pipelining with software thread integration

Authors:
Won So;Alexander G. Dean
Affiliations:
North Carolina State University, Raleigh, NC;North Carolina State University, Raleigh, NC
Venue:
LCTES '05 Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Year:
2005

Citing 31
Cited 4

The program dependence graph and its use in optimization

ACM Transactions on Programming Languages and Systems (TOPLAS)
SIGNAL: A declarative language for synchronous programming of real-time systems

Proc. of a conference on Functional programming languages and computer architecture
Estimating interlock and improving balance for pipelined architectures

Journal of Parallel and Distributed Computing
Software pipelining: an effective scheduling technique for VLIW machines

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Enhanced modulo scheduling for loops with conditional branches

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
The ESTEREL synchronous programming language: design, semantics, implementation

Science of Computer Programming
Lifetime-sensitive modulo scheduling

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Reverse If-Conversion

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Selective specialization for object-oriented languages

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Modulo scheduling with multiple initiation intervals

Proceedings of the 28th annual international symposium on Microarchitecture
Modulo scheduling of loops in control-intensive non-numeric programs

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Software pipelining loops with conditional branches

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
GURPR—a method for global software pipelining

MICRO 20 Proceedings of the 20th annual workshop on Microprogramming
Unroll-and-jam using uniformly generated sets

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Modulo scheduling for the TMS320C6x VLIW DSP architecture

Proceedings of the ACM SIGPLAN 1999 workshop on Languages, compilers, and tools for embedded systems
Loop fusion for clustered VLIW architectures

Proceedings of the joint conference on Languages, compilers and tools for embedded systems: software and compilers for embedded systems
Conversion of control dependence to data dependence

POPL '83 Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
A stream compiler for communication-exposed architectures

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Imagine: Media Processing with Streams

IEEE Micro
The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs

IEEE Micro
Perfect Pipelining: A New Loop Parallelization Technique

ESOP '88 Proceedings of the 2nd European Symposium on Programming
FIAT: A Framework for Interprocedural Analysis and Transfomation

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Optimizing Loop Performance for Clustered VLIW Architectures

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
StreamIt: A Language for Streaming Applications

CC '02 Proceedings of the 11th International Conference on Compiler Construction
Interprocedural Analysis for Parallelization

LCPC '95 Proceedings of the 8th International Workshop on Languages and Compilers for Parallel Computing
Improving Software Pipelining With Unroll-and-Jam

HICSS '96 Proceedings of the 29th Hawaii International Conference on System Sciences Volume 1: Software Technology and Architecture
Techniques for Software Thread Integration in Real-Time Embedded Systems

RTSS '98 Proceedings of the IEEE Real-Time Systems Symposium
System-Level Issues for Software Thread Integration: Guest Triggering and Host Selection

RTSS '99 Proceedings of the 20th IEEE Real-Time Systems Symposium
Compiling for Fine-Grain Concurrency: Planning and Performing Software Thread Integration

RTSS '02 Proceedings of the 23rd IEEE Real-Time Systems Symposium
Procedure Cloning and Integration for Converting Parallelism from Coarse to Fine Grain

INTERACT '03 Proceedings of the Seventh Workshop on Interaction between Compilers and Computer Architectures
Compiler-Directed ILP Extraction for Clustered VLIW/EPIC Machines: Predication, Speculation and Modulo Scheduling

DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1

Deep Jam: Conversion of Coarse-Grain Parallelism to Instruction-Level and Vector Parallelism for Irregular Applications

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Reaching fast code faster: using modeling for efficient software thread integration on a VLIW DSP

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Source level merging of independent programs

Journal of Parallel and Distributed Computing
Software thread integration for instruction-level parallelism

ACM Transactions on Embedded Computing Systems (TECS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Software pipelining is a critical optimization for producing efficient code for VLIW/EPIC and superscalar processors in high-performance embedded applications such as digital signal processing. Software thread integration (STI) can often improve the performance of looping code in cases where software pipelining performs poorly or fails. This paper examines both situations, presenting methods to determine what and when to integrate.We evaluate our methods on C-language image and digital signal processing libraries and synthetic loop kernels. We compile them for a very long instruction word (VLIW) digital signal processor (DSP) -- the Texas Instruments (TI) C64x architecture. Loops which benefit little from software pipelining (SWP-Poor) speed up by 26% (harmonic mean, HM). Loops for which software pipelining fails (SWP-Fail) due to conditionals and calls speed up by 16% (HM). Combining SWP-Good and SWP-Poor loops leads to a speedup of 55% (HM).