Unroll-and-jam using uniformly generated sets

Authors:
Steve Carr;Yiping Guan
Affiliations:
Department of Computer Science, Michigan Technological University, Houghton MI;Shafi Inc., 3637 Old US 23 Ste. 300, Brighton MI
Venue:
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Year:
1997

Citing 14
Cited 13

Estimating interlock and improving balance for pipelined architectures

Journal of Parallel and Distributed Computing
Strategies for cache and local memory management by global program transformation

Proceedings of the 1st International Conference on Supercomputing
Improving register allocation for subscripted variables

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
Practical dependence testing

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Scalar replacement in the presence of conditional control flow

Software—Practice & Experience
Memory-hierarchy management

Memory-hierarchy management
Improving the ratio of memory operations to floating-point operations in loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
Combining loop transformations considering caches and scheduling

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Parallel Programming Support in ParaScope

Proceedings of the 4th International DFVLR Seminar on Foundations of Engineering Sciences: Parallel Computing in Science and Engineering
Loop Quantization: an Analysis and Algorithm

Loop Quantization: an Analysis and Algorithm
Combining Optimization for Cache and Instruction-Level Parallelism

PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques

Optimized unrolling of nested loops

Proceedings of the 14th international conference on Supercomputing
Energy-driven integrated hardware-software optimizations using SimplePower

Proceedings of the 27th annual international symposium on Computer architecture
Optimized Unrolling of Nested Loops

International Journal of Parallel Programming
Register tiling in nonrectangular iteration spaces

ACM Transactions on Programming Languages and Systems (TOPLAS)
Evaluating Integrated Hardware-Software Optimizations Using a Unified Energy Estimation Framework

IEEE Transactions on Computers
On increasing architecture awareness in program optimizations to bridge the gap between peak and sustained processor performance: matrix-multiply revisited

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A Quantitative Analysis of Tile Size Selection Algorithms

The Journal of Supercomputing
Improving register allocation for subscripted variables

ACM SIGPLAN Notices - Best of PLDI 1979-1999
Predicting Unroll Factors Using Supervised Classification

Proceedings of the international symposium on Code generation and optimization
Complementing software pipelining with software thread integration

LCTES '05 Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Reaching fast code faster: using modeling for efficient software thread integration on a VLIW DSP

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Automatic analysis for managing and optimizing performance-code quality

Proceedings of the 2008 workshop on Static analysis
Compact multi-dimensional kernel extraction for register tiling

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern architectural trends in instruction-level parallelism (ILP) are to increase the computational power of microprocessors significantly. As a result, the demands on memory have increased. Unfortunately, memory systems have not kept pace. Even hierarchical cache structures are ineffective if programs do not exhibit cache locality. Because of this compilers need to be concerned not only with finding ILP to utilize machine resources effectively, but also with ensuring that the resulting code has a high degree of cache locality. One compiler transformation that is essential for a compiler to meet the above objectives is unroll-and-jam, or outer-loop unrolling. Previous work either has used a dependence-based model to compute unroll amounts, significantly increasing the size of the dependence graph, or has applied a more brute force technique. In this paper, we present an algorithm that uses a linear-algebra-based technique to compute unroll amounts. This technique results in an 84% reduction over dependence-based techniques in the total number of dependences needed in our benchmark suite. Additionally, there is no loss in optimization performance over previous techniques and a more elegant solution is utilized.