Deep Jam: Conversion of Coarse-Grain Parallelism to Instruction-Level and Vector Parallelism for Irregular Applications

Authors:
Patrick Carribault;Albert Cohen;William Jalby
Affiliations:
Bull SA, Les Clayes sous Bois, France;ALCHEMY Group, INRIA Futurs and LRI, Université Paris-Sud, France;LRC ITACA, CEA/DAM and Université de Versailles Saint-Quentin, France
Venue:
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Year:
2005

Citing 43
Cited 3

Software pipelining: an effective scheduling technique for VLIW machines

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Array expansion

ICS '88 Proceedings of the 2nd international conference on Supercomputing
Efficiently computing static single assignment form and the control dependence graph

ACM Transactions on Programming Languages and Systems (TOPLAS)
Array-data flow analysis and its use in array privatization

POPL '93 Proceedings of the 20th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Simultaneous multithreading: maximizing on-chip parallelism

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Maximal static expansion

POPL '98 Proceedings of the 25th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Array SSA form and its use in parallelization

POPL '98 Proceedings of the 25th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
The potential of data value speculation to boost ILP

ICS '98 Proceedings of the 12th international conference on Supercomputing
Integrated predicated and speculative execution in the IMPACT EPIC architecture

Proceedings of the 25th annual international symposium on Computer architecture
Advanced compiler design and implementation

Advanced compiler design and implementation
The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization

IEEE Transactions on Parallel and Distributed Systems
Selective value prediction

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
A fast bit-vector algorithm for approximate string matching based on dynamic programming

Journal of the ACM (JACM)
CHIMAERA: a high-performance architecture with a tightly-coupled reconfigurable functional unit

Proceedings of the 27th annual international symposium on Computer architecture
Overcoming the challenges to feedback-directed optimization (Keynote Talk)

DYNAMO '00 Proceedings of the ACM SIGPLAN workshop on Dynamic and adaptive compilation and optimization
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
A compiler approach to fast hardware design space exploration in FPGA-based systems

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
A scalable instruction queue design using dependence chains

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences

Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences
Adaptive Optimizing Compilers for the 21st Century

The Journal of Supercomputing
Parallel Programming with Polaris

Computer
The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs

IEEE Micro
The Data Parallel Programming Model: Foundations, HPF Realization, and Scientific Applications

The Data Parallel Programming Model: Foundations, HPF Realization, and Scientific Applications
The Advantages of Instance-Wise Reaching Definition Analyses in Array (S)SA

LCPC '98 Proceedings of the 11th International Workshop on Languages and Compilers for Parallel Computing
Automatic Array Privatization

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Generation of Synchronous Code for Automatic Parallelization of while Loops

Euro-Par '95 Proceedings of the First International Euro-Par Conference on Parallel Processing
Pipelining-Dovetailing: A Transformation to Enhance Software Pipelining for Nested Loops

CC '96 Proceedings of the 6th International Conference on Compiler Construction
Faster Bit-Parallel Approximate String Matching

CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
Improving Software Pipelining With Unroll-and-Jam

HICSS '96 Proceedings of the 29th Hawaii International Conference on System Sciences Volume 1: Software Technology and Architecture
In Search of Speculative Thread-Level Parallelism

PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
Techniques for Software Thread Integration in Real-Time Embedded Systems

RTSS '98 Proceedings of the IEEE Real-Time Systems Symposium
Procedure Cloning and Integration for Converting Parallelism from Coarse to Fine Grain

INTERACT '03 Proceedings of the Seventh Workshop on Interaction between Compilers and Computer Architectures
Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture

Proceedings of the 30th annual international symposium on Computer architecture
Single-Dimension Software Pipelining for Multi-Dimensional Loops

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Exploring the Performance Potential of Itanium® Processors with ILP-based Scheduling

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
From Sequences of Dependent Instructions to Functions: An Approach for Improving Performance without ILP or Speculation

Proceedings of the 31st annual international symposium on Computer architecture
The Value Evolution Graph and its Use in Memory Reference Analysis

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Toward kilo-instruction processors

ACM Transactions on Architecture and Code Optimization (TACO)
Complementing software pipelining with software thread integration

LCTES '05 Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Trace Scheduling: A Technique for Global Microcode Compaction

IEEE Transactions on Computers
Branch strategies to optimize decision trees for wide-issue architectures

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Collisions of SHA-0 and reduced SHA-1

EUROCRYPT'05 Proceedings of the 24th annual international conference on Theory and Applications of Cryptographic Techniques

Reaching fast code faster: using modeling for efficient software thread integration on a VLIW DSP

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
The polyhedral model is more widely applicable than you think

CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Software thread integration for instruction-level parallelism

ACM Transactions on Embedded Computing Systems (TECS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

A number of compute-intensive applications suffer from performance loss due to the lack of instruction-level parallelism in sequences of dependent instructions. This is particularly accurate on wide-issue architectures with large register banks, when the memory hierarchy (locality and bandwidth) is not the dominant bottleneck. We consider two real applications from computational biology and from cryptanalysis, characterized by long sequences of dependent instructions, irregular control-flow and intricate scalar and array dependence patterns. Although these applications exhibit excellent memory locality and branch-prediction behavior, state-of-the-art loop transformations and back-end optimizations are unable to exploit much instruction-level parallelism. We show that good speedups can be achieved through deep jam, a new transformation of the program control- and data-flow. Deep jam combines scalar and array renaming with a generalized form of recursive unroll-and-jam; it brings together independent instructions across irregular control structures, removing memorybased dependences. This optimization contributes to the extraction of fine-grain parallelism in irregular applications. We propose a feedback-directed deep jam algorithm, selecting a jamming strategy, function of the architecture and application charactristics.