Improving Software Pipelining With Unroll-and-Jam

Authors:
Steve Carr;Chen Ding;Philip Sweany
Affiliations:
-;-;-
Venue:
HICSS '96 Proceedings of the 29th Hawaii International Conference on System Sciences Volume 1: Software Technology and Architecture
Year:
1996

Citing 0
Cited 26

Compiler blockability of dense matrix factorizations

ACM Transactions on Mathematical Software (TOMS)
Optimizing Loop Performance for Clustered VLIW Architectures

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Compiler-directed instruction cache leakage optimization

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
A compiler approach for reducing data cache energy

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Exploiting program hotspots and code sequentiality for instruction cache leakage management

Proceedings of the 2003 international symposium on Low power electronics and design
Parallel Processing of First Order Linear Recurrence on SMP Machines

The Journal of Supercomputing
Single-Dimension Software Pipelining for Multi-Dimensional Loops

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Reducing instruction cache energy consumption using a compiler-based strategy

ACM Transactions on Architecture and Code Optimization (TACO)
Register allocation for software pipelined multi-dimensional loops

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Complementing software pipelining with software thread integration

LCTES '05 Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Deep Jam: Conversion of Coarse-Grain Parallelism to Instruction-Level and Vector Parallelism for Irregular Applications

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies

International Journal of Parallel Programming
Reaching fast code faster: using modeling for efficient software thread integration on a VLIW DSP

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Single-dimension software pipelining for multidimensional loops

ACM Transactions on Architecture and Code Optimization (TACO)
Scheduling of Iterative Algorithms with Matrix Operations for Efficient FPGA Design--Implementation of Finite Interval Constant Modulus Algorithm

Journal of VLSI Signal Processing Systems
MPSoC memory optimization using program transformation

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Code-size conscious pipelining of imperfectly nested loops

MEDEA '07 Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture
Register allocation for software pipelined multidimensional loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
Software Pipelining in Nested Loops with Prolog-Epilog Merging

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Patterns of optimized loops

Proceedings of the 2010 Workshop on Parallel Programming Patterns
Efficient parallel stencil convolution in Haskell

Proceedings of the 4th ACM symposium on Haskell
Register pressure in software-pipelined loop nests: fast computation and impact on architecture design

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Analytic models and empirical search: a hybrid approach to code optimization

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Multi-dimensional kernel generation for loop nest software pipelining

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Guiding parallel array fusion with indexed types

Proceedings of the 2012 Haskell Symposium
Software thread integration for instruction-level parallelism

ACM Transactions on Embedded Computing Systems (TECS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

To take advantage of recent architectural improvements in microprocessors, advanced compiler optimizations such as software pipelining have been developed. Unfortunately, not all loops have enough parallelism in the innermost loop body to take advantage of all of the resources a machine provides. Unroll-and-jam is a transformation that can be used to increase the amount of parallelism in the innermost loop body by making better use of resources and limiting the effects of recurrences. In this paper, we demonstrate how unroll-and-jam can significantly improve the initiation interval in a software-pipelined loop. Improvements in the initiation interval of greater than 40% are common, while dramatic improvements of a factor of 5 are possible.