VLIW compilation techniques in a superscalar environment

Authors:
Kemal Ebcioglu;Randy D. Groves;Ki-Chang Kim;Gabriel M. Silberman;Isaac Ziv
Affiliations:
IBM T.J. Watson Research Center, P.O. Box 218, Yorktown Heights, NY;IBM Rise Systern/6000 Division, 11400 Bumet Road, Austin, TX;IBM T.J. Watson Research Center, P.O. Box 218, Yorktown Heights, NY;IBM T.J. Watson Research Center, P.O. Box 218, Yorktown Heights, NY;IBM T.J. Watson Research Center, P.O. Box 218, Yorktown Heights, NY
Venue:
PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Year:
1994

Citing 22
Cited 13

Bulldog: a compiler for VLSI architectures

Bulldog: a compiler for VLSI architectures
Compilers: principles, techniques, and tools

Compilers: principles, techniques, and tools
The program dependence graph and its use in optimization

ACM Transactions on Programming Languages and Systems (TOPLAS)
Software pipelining: an effective scheduling technique for VLIW machines

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Limits on multiple instruction issue

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
“Combining” as a compilation technique for VLIW architectures

MICRO 22 Proceedings of the 22nd annual workshop on Microprogramming and microarchitecture
Region Scheduling: An Approach for Detecting and Redistributing Parallelism

IEEE Transactions on Software Engineering
Instruction scheduling beyond basic blocks

IBM Journal of Research and Development
A new compilation technique for parallelizing loops with unpredictable branches on a VLIW architecture

Selected papers of the second workshop on Languages and compilers for parallel computing
Circular scheduling: a new technique to perform software pipelining

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Global instruction scheduling for superscalar machines

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
IMPACT: an architectural framework for multiple-instruction-issue processors

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Avoiding unconditional jumps by code replication

PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
Optimally profiling and tracing programs

POPL '92 Proceedings of the 19th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
An architectural framework for migration from CISC to higher performance platforms

ICS '92 Proceedings of the 6th international conference on Supercomputing
An efficient resource-constrained global scheduling technique for superscalar and VLIW processors

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Proving safety of speculative load instructions at compile-time

ESOP'92 Symposium proceedings on 4th European symposium on programming
The revival transformation

POPL '94 Proceedings of the 21st ACM SIGPLAN-SIGACT symposium on Principles of programming languages
A compilation technique for software pipelining of loops with conditional jumps

MICRO 20 Proceedings of the 20th annual workshop on Microprogramming
A global resource-constrained parallelization technique

ICS '89 Proceedings of the 3rd international conference on Supercomputing
The Design and Analysis of Computer Algorithms

The Design and Analysis of Computer Algorithms
Making Compaction-Based Parallelization Affordable

IEEE Transactions on Parallel and Distributed Systems

A reduced multipipeline machine description that preserves scheduling constraints

PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
Source-level debugging of scalar optimized code

PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
Scalable instruction-level parallelism through tree-instructions

ICS '97 Proceedings of the 11th international conference on Supercomputing
Resource-sensitive profile-directed data flow analysis for code optimization

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Parallelizing nonnumerical code with selective scheduling and software pipelining

ACM Transactions on Programming Languages and Systems (TOPLAS)
Binary translation and architecture convergence issues for IBM system/390

Proceedings of the 14th international conference on Supercomputing
A Prolog Tailoring Technique on an Epilog Tailored Procedure

PSI '02 Revised Papers from the 4th International Andrei Ershov Memorial Conference on Perspectives of System Informatics: Akademgorodok, Novosibirsk, Russia
Precise Exception Semantics in Dynamic Compilation

CC '02 Proceedings of the 11th International Conference on Compiler Construction
A compiler approach for reducing data cache energy

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Instruction Scheduling for Low Power

Journal of VLSI Signal Processing Systems
Efficient instruction scheduling for a pipelined architecture

ACM SIGPLAN Notices - Best of PLDI 1979-1999
Reducing data cache leakage energy using a compiler-based approach

ACM Transactions on Embedded Computing Systems (TECS)
How many threads to spawn during program multithreading?

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe techniques for converting the intermediate code representation of a given program, as generated by a modern compiler, to another representation which produces the same run-time results, but can run faster on a superscalar machine. The algorithms, based on novel parallelization techniques for Very Long Instruction Word (VLIW) architectures, find and place together independently executable operations that may be far apart in the original code. i.e., they may be separated by many conditional branches or belong to different iterations of a loop. As a result, the functional units in the superscalar are presented with more work that can proceed in parallel, thus achieving higher performance than the approach of using hardware instruction dispatch techniques alone.While general scheduling techniques improve performance by removing idle pipeline cycles, to further improve performance on a superscalar with only a few functional units requires a reduction in the pathlength. We have designed a set of new algorithms for reducing pathlength and removing stalls due to branches, namely speculative load-store motion out of loops, unspeculation, limited combining, basic block expansion, and prolog tailoring. These algorithms were implemented in a prototype version of the IBM RS/6000 xlc compiler and have shown significant improvement in SPEC integer benchmarks on the IBM POWER machines.Also, we describe a new technique to obtain profiling information with low overhead, and some applications of profiling directed feedback, including scheduling heuristics, code reordering and branch reversal.