A study of scalar compilation techniques for pipelined supercomputers

Authors:
Shlomo Weiss;James E. Smith
Affiliations:
Univ. of Maryland, Baltimore;Cray Research, Inc., Chippewa Falls, WI
Venue:
ACM Transactions on Mathematical Software (TOMS)
Year:
1990

Citing 10
Cited 0

Advanced compiler optimizations for supercomputers

Communications of the ACM - Special issue on parallelism
Optimal loop parallelization

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Software pipelining: an effective scheduling technique for VLIW machines

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Characterizing computer performance with a single number

Communications of the ACM
Decoupled access/execute computer architectures

ACM Transactions on Computer Systems (TOCS)
A Fortran compiler for the FPS-164 scientific computer

SIGPLAN '84 Proceedings of the 1984 SIGPLAN symposium on Compiler construction
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness
Perfect Pipelining: A New Loop Parallelization Technique

ESOP '88 Proceedings of the 2nd European Symposium on Programming
Very Long Instruction Word architectures and the ELI-512

ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
Efficient code generation for horizontal architectures: Compiler techniques and architectural support

ISCA '82 Proceedings of the 9th annual symposium on Computer Architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper studies two compilation techniques for enhancing scalar performance in high-speed scientific processors: software pipelining and loop unrolling. We study the impact of the architecture (size of the register file) and of the hardware (size of instruction buffer) on the efficiency of loop unrolling. We also develop a methodology for classifying software pipelining techniques. For loop unrolling, a straightforward scheduling algorithm is shown to produce near-optimal results when not inhibited by recurrences or memory hazards. Our study indicates that the performance produced with a modified CRAY-1S scalar architecture and a code scheduler utilizing loop unrolling is comparable to the performance achieved by the CRAY-1S with a vector unit and the CFT vectorizing compiler. Finally, we show that the combination of loop unrolling and dynamic software pipelining, as implemented by a decoupled computer, substantially outperforms the vector CRAY-1S.