Fast greedy weighted fusion

Authors:
Ken Kennedy
Affiliations:
Center for High Performance Software, Rice University
Venue:
Proceedings of the 14th international conference on Supercomputing
Year:
2000

Citing 14
Cited 24

Automatic decomposition of scientific programs for parallel execution

POPL '87 Proceedings of the 14th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
Improving register allocation for subscripted variables

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
Optimization of array accesses by collective loop transformations

ICS '91 Proceedings of the 5th international conference on Supercomputing
Vector Register Allocation

IEEE Transactions on Computers
Memory-hierarchy management

Memory-hierarchy management
The Design and Analysis of Computer Algorithms

The Design and Analysis of Computer Algorithms
The Classification, Fusion, and Parallelization of Array Language Primitives

IEEE Transactions on Parallel and Distributed Systems
Collective Loop Fusion for Array Contraction

Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Stream processing

LFP '84 Proceedings of the 1984 ACM Symposium on LISP and functional programming
On the Complexity of Loop Fusion

PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
The Memory Bandwidth Bottleneck and its Amelioration by a Compiler

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Dependence analysis for subscripted variables and its application to program transformations

Dependence analysis for subscripted variables and its application to program transformations
Improving effective bandwidth through compiler enhancement of global and dynamic cache reuse

Improving effective bandwidth through compiler enhancement of global and dynamic cache reuse

Loop optimization for a class of memory-constrained computations

ICS '01 Proceedings of the 15th international conference on Supercomputing
Space-time trade-off optimization for a class of electronic structure calculations

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Loop Restructuring for Data I/O Minimization on Limited On-Chip Memory Embedded Processors

IEEE Transactions on Computers
Towards Automatic Synthesis of High-Performance Codes for Electronic Structure Calculations: Data Locality Optimization

HiPC '01 Proceedings of the 8th International Conference on High Performance Computing
Improving Effective Bandwidth through Compiler Enhancement of Global Cache Reuse

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Transforming Complex Loop Nests for Locality

The Journal of Supercomputing
Improving effective bandwidth through compiler enhancement of global cache reuse

Journal of Parallel and Distributed Computing
Improving Data Locality by Array Contraction

IEEE Transactions on Computers
Automatic tiling of iterative stencil loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
The Potential of Computation Regrouping for Improving Locality

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
A case for a working-set-based memory hierarchy

Proceedings of the 2nd conference on Computing frontiers
Automatic blocking of QR and LU factorizations for locality

MSP '04 Proceedings of the 2004 workshop on Memory system performance
Improving Memory Hierarchy Performance through Combined Loop Interchange and Multi-Level Fusion

International Journal of High Performance Computing Applications
Integrated Loop Optimizations for Data Locality Enhancement of Tensor Contraction Expressions

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Profitable loop fusion and tiling using model-driven empirical search

Proceedings of the 20th annual international conference on Supercomputing
MPSoC memory optimization using program transformation

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Forma: A framework for safe automatic array reshaping

ACM Transactions on Programming Languages and Systems (TOPLAS)
A compiler framework for general memory layout optimizations targeting structures

Proceedings of the 2010 Workshop on Interaction between Compilers and Computer Architecture
Model-guided empirical tuning of loop fusion

International Journal of High Performance Systems Architecture
A model for fusion and code motion in an automatic parallelizing compiler

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
DESOLA: An active linear algebra library using delayed evaluation and runtime code generation

Science of Computer Programming
A cache-conscious profitability model for empirical tuning of loop fusion

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Efficient search-space pruning for integrated fusion and tiling transformations

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Iterative collective loop fusion

CC'06 Proceedings of the 15th international conference on Compiler Construction

Quantified Score

Hi-index	0.01

Visualization

Abstract

Loop fusion is important to optimizing compilers because it is an important tool in managing the memory hierarchy. By fusing loops that use the same data elements, we can reduce the distance between accesses to the same datum and avoid costly cache misses. Unfortunately the problem of optimal loop fusion for reuse has been shown to be NP-hard, so compilers must resort to heuristics to avoid unreasonably long compile times. Greedy strategies are often excellent heuristics that produce high-quality solutions quickly. We present an algorithm for greedy weighted fusion, in which the heaviest edge (the one with the most reuse) is selected for possible fusion on each step. The algorithm is shown to be fast in the sense that it takes O(V(E+V)) time, which is arguably optimal for this problem.