Compilation for a high-performance systolic array
SIGPLAN '86 Proceedings of the 1986 SIGPLAN symposium on Compiler construction
Highly concurrent scalar processing
Highly concurrent scalar processing
URPR—An extension of URCR for software pipelining
MICRO 19 Proceedings of the 19th annual workshop on Microprogramming
A study of scalar compilation techniques for pipelined supercomputers
ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
A VLIW architecture for a trace scheduling compiler
ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
The warp computer: Architecture, implementation, and performance
IEEE Transactions on Computers
Compiler optimizations for asynchronous systolic array programs
POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Lifetime-sensitive modulo scheduling
PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
A novel framework of register allocation for software pipelining
POPL '93 Proceedings of the 20th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
The multiflow trace scheduling compiler
The Journal of Supercomputing - Special issue on instruction-level parallelism
Minimizing register requirements under resource-constrained rate-optimal software pipelining
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Improving the ratio of memory operations to floating-point operations in loops
ACM Transactions on Programming Languages and Systems (TOPLAS)
Software pipelining showdown: optimal vs. heuristic methods in a production compiler
PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
Combining loop transformations considering caches and scheduling
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
A compilation technique for software pipelining of loops with conditional jumps
MICRO 20 Proceedings of the 20th annual workshop on Microprogramming
GURPR—a method for global software pipelining
MICRO 20 Proceedings of the 20th annual workshop on Microprogramming
Communications of the ACM
Parallel processing: a smart compiler and a dumb machine
SIGPLAN '84 Proceedings of the 1984 SIGPLAN symposium on Compiler construction
A Fortran compiler for the FPS-164 scientific computer
SIGPLAN '84 Proceedings of the 1984 SIGPLAN symposium on Compiler construction
Dependence graphs and compiler optimizations
POPL '81 Proceedings of the 8th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Computers and Intractability: A Guide to the Theory of NP-Completeness
Computers and Intractability: A Guide to the Theory of NP-Completeness
Perfect Pipelining: A New Loop Parallelization Technique
ESOP '88 Proceedings of the 2nd European Symposium on Programming
Fine-Grain Scheduling under Resource Constraints
LCPC '94 Proceedings of the 7th International Workshop on Languages and Compilers for Parallel Computing
MICRO 14 Proceedings of the 14th annual workshop on Microprogramming
Global optimization of microprograms through modular control constructs
MICRO 12 Proceedings of the 12th annual workshop on Microprogramming
Improving the throughput of a pipeline by insertion of delays
ISCA '76 Proceedings of the 3rd annual symposium on Computer architecture
An improvement of trace scheduling for global microcode compaction
MICRO 17 Proceedings of the 17th annual workshop on Microprogramming
The optimization of horizontal microcode within and beyond basic blocks: an application of processor scheduling with resources
Bulldog: a compiler for vliw architectures (parallel computing, reduced-instruction-set, trace scheduling, scientific)
A systolic array optimizing compiler
A systolic array optimizing compiler
On minimizing register usage of linearly scheduled algorithms with uniform dependencies
Computer Languages, Systems and Structures
Integrated modulo scheduling and cluster assignment for TI TMS320C64x+ architecture
Proceedings of the 11th Workshop on Optimizations for DSP and Embedded Systems
Hi-index | 0.00 |
The basic idea behind software pipelining was first developed by Patel and Davidson for scheduling hardware pipe-lines. As instruction-level parallelism made its way into general-purpose computing, it became necessary to automate scheduling. How and whether instructions can be scheduled statically have major ramifications on the design of computer architectures. Rau and Glaeser were the first to use software pipelining in a compiler for a machine with specialized hardware designed to support software pipelining. In the meantime, trace scheduling was touted to be the scheduling technique of choice for VLIW (Very Long Instruction Word) machines. The most important contribution from this paper is to show that software pipelining is effective on VLIW machines without complicated hardware support. Our understanding of software pipelining subsequently deepened with the work of many others. And today, software pipelining is used in all advanced compilers for machines with instruction-level parallelism, none of which, except the Intel Itanium, relies on any specialized support for software pipelining.This paper shows that software pipelining is an effective and viable scheduling technique for VLIW processors. In software pipelining, iterations of a loop in the source program are continuously initiated at constant intervals, before the preceding iterations complete. The advantage of software pipelining is that optimal performance can be achieved with compact object code.This paper extends previous results of software pipelining in two ways: First, this paper shows that by using an improved algorithm, near-optimal performance can be obtained without specialized hardware. Second, we propose a hierarchical reduction scheme whereby entire control constructs are reduced to an object similar to an operation in a basic block. With this scheme, all innermost loops, including those containing conditional statements, can be software pipelined. It also diminishes the start-up cost of loops with small number of iterations. Hierarchical reduction complements the software pipelining technique, permitting a consistent performance improvement be obtained.The techniques proposed have been validated by an implementation of a compiler for Warp, a systolic array consisting of 10 VLIW processors. This compiler has been used for developing a large number of applications in the areas of image, signal and scientific processing.