Software pipelining: an effective scheduling technique for VLIW machines
PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
ICS '88 Proceedings of the 2nd international conference on Supercomputing
Overlapped loop support in the Cydra 5
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Efficiently computing static single assignment form and the control dependence graph
ACM Transactions on Programming Languages and Systems (TOPLAS)
Array-data flow analysis and its use in array privatization
POPL '93 Proceedings of the 20th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Some efficient solutions to the affine scheduling problem: I. One-dimensional time
International Journal of Parallel Programming
Iterative modulo scheduling: an algorithm for software pipelining loops
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Beyond induction variables: detecting and classifying sequences using a demand-driven SSA form
ACM Transactions on Programming Languages and Systems (TOPLAS)
Improving data locality with loop transformations
ACM Transactions on Programming Languages and Systems (TOPLAS)
Loop Shifting for Loop Compaction
International Journal of Parallel Programming - Special issue on instruction-level parallelism and parallelizing compilation, part 2
Optimizing compilers for modern architectures: a dependence-based approach
Optimizing compilers for modern architectures: a dependence-based approach
Efficient Pipelining of Nested Loops: Unroll-and-Squash
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Optimal Software Pipelining of Nested Loops
Proceedings of the 8th International Symposium on Parallel Processing
Software Pipelining of Nested Loops
CC '01 Proceedings of the 10th International Conference on Compiler Construction
Itanium 2 Processor Microarchitecture
IEEE Micro
Phased scheduling of stream programs
Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systems
Improving Software Pipelining With Unroll-and-Jam
HICSS '96 Proceedings of the 29th Hawaii International Conference on System Sciences Volume 1: Software Technology and Architecture
PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
Code Generation for Single-Dimension Software Pipelining of Multi-Dimensional Loops
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Single-Dimension Software Pipelining for Multi-Dimensional Loops
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies
International Journal of Parallel Programming
Automatic Correction of Loop Transformations
PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
Early control of register pressure for software pipelined loops
CC'03 Proceedings of the 12th international conference on Compiler construction
The Multidimensional Knapsack Problem: Structure and Algorithms
INFORMS Journal on Computing
A New Heuristic for Solving the Multichoice Multidimensional Knapsack Problem
IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans
Improving performance of nested loops on reconfigurable array processors
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Hi-index | 0.00 |
Software pipelining (or modulo scheduling) is a powerful back-end optimization to exploit instruction and vector parallelism. Software pipelining is particularly popular for embedded devices as it improves the computation throughput without increasing the size of the inner loop kernel (unlike loop unrolling), a desirable property to minimize the amount of code in local memories or caches. Unfortunately, common media and signal processing codes exhibit series of low-trip-count inner loops. In this situation, software pipelining is often not an option: it incurs severe fill/drain time overheads and code size expansion due to nested prologs and epilogs. We propose a method to pipeline series of inner loops without increasing the size of the loop nest, apart from an outermost prolog and epilog. Our method achieves significant code size savings and allows pipelining of low-trip-count loops. These benefits come at the cost of additional scheduling constraints, leading to a linear optimization problem to trade memory usage for pipelining opportunities.