Reaching fast code faster: using modeling for efficient software thread integration on a VLIW DSP

Authors:
Won So;Alexander G. Dean
Affiliations:
North Carolina State University, Raleigh, NC;North Carolina State University, Raleigh, NC
Venue:
CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Year:
2006

Citing 26
Cited 1

Estimating interlock and improving balance for pipelined architectures

Journal of Parallel and Distributed Computing
Compiling C for vectorization, parallelization, and inline expansion

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Software pipelining: an effective scheduling technique for VLIW machines

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Inline function expansion for compiling C programs

PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
Region Scheduling: An Approach for Detecting and Redistributing Parallelism

IEEE Transactions on Software Engineering
Improving register allocation for subscripted variables

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
Subprogram Inlining: A Study of its Effects on Program Execution Time

IEEE Transactions on Software Engineering
Effective compiler support for predicated execution using the hyperblock

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Enhanced region scheduling on a program dependence graph

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Enhanced modulo scheduling for loops with conditional branches

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Reverse If-Conversion

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
The superblock: an effective technique for VLIW and superscalar compilation

The Journal of Supercomputing - Special issue on instruction-level parallelism
Region-based compilation: an introduction and motivation

Proceedings of the 28th annual international symposium on Microarchitecture
Unroll-and-jam using uniformly generated sets

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Loop fusion for clustered VLIW architectures

Proceedings of the joint conference on Languages, compilers and tools for embedded systems: software and compilers for embedded systems
Conversion of control dependence to data dependence

POPL '83 Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
Optimizing Loop Performance for Clustered VLIW Architectures

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Improving Software Pipelining With Unroll-and-Jam

HICSS '96 Proceedings of the 29th Hawaii International Conference on System Sciences Volume 1: Software Technology and Architecture
Efficient code generation for horizontal architectures: Compiler techniques and architectural support

ISCA '82 Proceedings of the 9th annual symposium on Computer Architecture
Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing

MICRO 14 Proceedings of the 14th annual workshop on Microprogramming
Treegion Scheduling for Wide Issue Processors

HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
Procedure Cloning and Integration for Converting Parallelism from Coarse to Fine Grain

INTERACT '03 Proceedings of the Seventh Workshop on Interaction between Compilers and Computer Architectures
Loop Quantization: an Analysis and Algorithm

Loop Quantization: an Analysis and Algorithm
Complementing software pipelining with software thread integration

LCTES '05 Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Deep Jam: Conversion of Coarse-Grain Parallelism to Instruction-Level and Vector Parallelism for Irregular Applications

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
MiBench: A free, commercially representative embedded benchmark suite

WWC '01 Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop

Software thread integration for instruction-level parallelism

ACM Transactions on Embedded Computing Systems (TECS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

When integrating software threads together to boost performance on a processor with instruction-level parallel processing support, it is rarely clear which code regions should be aligned and integrated, and which regions should be left alone. This problem grows even worse on a modern VLIW DSP due to complicating factors in both the hardware and compiler: software pipelining, predication, branch delay slots, load delay slots and limited resources. As a result, finding an effective integration strategy requires extensive iteration through the integrate/compile/analyze sequence.In this paper we introduce methods to quantitatively estimate the performance benefit from the integration of multiple software threads. We use resource modeling, consider register pressure and compensate for compiler optimizations. This enables different scenarios to be compared and ranked. We then use these estimates to guide integration by concentrating on the most beneficial scenario. Information from each iteration of compilation is used to update the rankings of scenarios. We find that our modeling methods combined with limited compilation quickly find the best integration scenario without requiring exhaustive integration.