Software thread integration for instruction-level parallelism

Authors:
Won So;Alexander G. Dean
Affiliations:
North Carolina State University;North Carolina State University
Venue:
ACM Transactions on Embedded Computing Systems (TECS)
Year:
2013

Citing 34
Cited 0

The program dependence graph and its use in optimization

ACM Transactions on Programming Languages and Systems (TOPLAS)
Estimating interlock and improving balance for pipelined architectures

Journal of Parallel and Distributed Computing
Compiling C for vectorization, parallelization, and inline expansion

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Software pipelining: an effective scheduling technique for VLIW machines

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Inline function expansion for compiling C programs

PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
Region Scheduling: An Approach for Detecting and Redistributing Parallelism

IEEE Transactions on Software Engineering
Limits of instruction-level parallelism

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Subprogram Inlining: A Study of its Effects on Program Execution Time

IEEE Transactions on Software Engineering
Effective compiler support for predicated execution using the hyperblock

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Enhanced region scheduling on a program dependence graph

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Enhanced modulo scheduling for loops with conditional branches

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
The superblock: an effective technique for VLIW and superscalar compilation

The Journal of Supercomputing - Special issue on instruction-level parallelism
Simultaneous multithreading: maximizing on-chip parallelism

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Region-based compilation: an introduction and motivation

Proceedings of the 28th annual international symposium on Microarchitecture
Conversion of control dependence to data dependence

POPL '83 Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
Optimizing Loop Performance for Clustered VLIW Architectures

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
StreamIt: A Language for Streaming Applications

CC '02 Proceedings of the 11th International Conference on Compiler Construction
Improving Software Pipelining With Unroll-and-Jam

HICSS '96 Proceedings of the 29th Hawaii International Conference on System Sciences Volume 1: Software Technology and Architecture
Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing

MICRO 14 Proceedings of the 14th annual workshop on Microprogramming
Treegion Scheduling for Wide Issue Processors

HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation

PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
Techniques for Software Thread Integration in Real-Time Embedded Systems

RTSS '98 Proceedings of the IEEE Real-Time Systems Symposium
Compiling for Fine-Grain Concurrency: Planning and Performing Software Thread Integration

RTSS '02 Proceedings of the 23rd IEEE Real-Time Systems Symposium
Code Size Efficiency in Global Scheduling for ILP Processors

INTERACT '02 Proceedings of the Sixth Annual Workshop on Interaction between Compilers and Computer Architectures
Procedure Cloning and Integration for Converting Parallelism from Coarse to Fine Grain

INTERACT '03 Proceedings of the Seventh Workshop on Interaction between Compilers and Computer Architectures
Loop Quantization: an Analysis and Algorithm

Loop Quantization: an Analysis and Algorithm
Software thread integration for hardware to software migration

Software thread integration for hardware to software migration
Finding effective compilation sequences

Proceedings of the 2004 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Complementing software pipelining with software thread integration

LCTES '05 Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Deep Jam: Conversion of Coarse-Grain Parallelism to Instruction-Level and Vector Parallelism for Irregular Applications

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
MiBench: A free, commercially representative embedded benchmark suite

WWC '01 Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop
Reaching fast code faster: using modeling for efficient software thread integration on a VLIW DSP

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Trace Scheduling: A Technique for Global Microcode Compaction

IEEE Transactions on Computers
An Approach to Scientific Array Processing: The Architectural Design of the AP-120B/FPS-164 Family

Computer

Quantified Score

Hi-index	0.00

Visualization

Abstract

Multimedia applications require a significantly higher level of performance than previous workloads of embedded systems. They have driven digital signal processor (DSP) makers to adopt high-performance architectures like VLIW (Very-Long Instruction Word). Despite many efforts to exploit instruction-level parallelism (ILP) in the application, the speed is a fraction of what it could be, limited by the difficulty of finding enough independent instructions to keep all of the processor's functional units busy. This article proposes Software Thread Integration (STI) for instruction-level parallelism. STI is a software technique for interleaving multiple threads of control into a single implicitly multithreaded one. We use STI to improve the performance on ILP processors by merging parallel procedures into one, increasing the compiler's scope and hence allowing it to create a more efficient instruction schedule. Assuming the parallel procedures are given, we define a methodology for finding the best performing integrated procedure with a minimum compilation time. We quantitatively estimate the performance impact of integration, allowing various integration scenarios to be compared and ranked via profitability analysis. During integration of threads, different ILP-improving code transformations are selectively applied according to the control structure and the ILP characteristics of the code, driven by interactions with software pipelining. The estimated profitability is verified and corrected by an iterative compilation approach, compensating for possible estimation inaccuracy. Our modeling methods combined with limited compilation quickly find the best integration scenario without requiring exhaustive integration.