Automatic decomposition of scientific programs for parallel execution
POPL '87 Proceedings of the 14th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
Improving register allocation for subscripted variables
PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
Optimization of array accesses by collective loop transformations
ICS '91 Proceedings of the 5th international conference on Supercomputing
IEEE Transactions on Computers
Memory-hierarchy management
The Design and Analysis of Computer Algorithms
The Design and Analysis of Computer Algorithms
The Classification, Fusion, and Parallelization of Array Language Primitives
IEEE Transactions on Parallel and Distributed Systems
Collective Loop Fusion for Array Contraction
Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution
Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
LFP '84 Proceedings of the 1984 ACM Symposium on LISP and functional programming
On the Complexity of Loop Fusion
PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
The Memory Bandwidth Bottleneck and its Amelioration by a Compiler
IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Dependence analysis for subscripted variables and its application to program transformations
Dependence analysis for subscripted variables and its application to program transformations
Improving effective bandwidth through compiler enhancement of global and dynamic cache reuse
Improving effective bandwidth through compiler enhancement of global and dynamic cache reuse
Loop optimization for a class of memory-constrained computations
ICS '01 Proceedings of the 15th international conference on Supercomputing
Space-time trade-off optimization for a class of electronic structure calculations
PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Loop Restructuring for Data I/O Minimization on Limited On-Chip Memory Embedded Processors
IEEE Transactions on Computers
HiPC '01 Proceedings of the 8th International Conference on High Performance Computing
Improving Effective Bandwidth through Compiler Enhancement of Global Cache Reuse
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Transforming Complex Loop Nests for Locality
The Journal of Supercomputing
Improving effective bandwidth through compiler enhancement of global cache reuse
Journal of Parallel and Distributed Computing
Improving Data Locality by Array Contraction
IEEE Transactions on Computers
Automatic tiling of iterative stencil loops
ACM Transactions on Programming Languages and Systems (TOPLAS)
The Potential of Computation Regrouping for Improving Locality
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
A case for a working-set-based memory hierarchy
Proceedings of the 2nd conference on Computing frontiers
Automatic blocking of QR and LU factorizations for locality
MSP '04 Proceedings of the 2004 workshop on Memory system performance
Improving Memory Hierarchy Performance through Combined Loop Interchange and Multi-Level Fusion
International Journal of High Performance Computing Applications
Integrated Loop Optimizations for Data Locality Enhancement of Tensor Contraction Expressions
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Profitable loop fusion and tiling using model-driven empirical search
Proceedings of the 20th annual international conference on Supercomputing
MPSoC memory optimization using program transformation
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Forma: A framework for safe automatic array reshaping
ACM Transactions on Programming Languages and Systems (TOPLAS)
A compiler framework for general memory layout optimizations targeting structures
Proceedings of the 2010 Workshop on Interaction between Compilers and Computer Architecture
Model-guided empirical tuning of loop fusion
International Journal of High Performance Systems Architecture
A model for fusion and code motion in an automatic parallelizing compiler
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
DESOLA: An active linear algebra library using delayed evaluation and runtime code generation
Science of Computer Programming
A cache-conscious profitability model for empirical tuning of loop fusion
LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Efficient search-space pruning for integrated fusion and tiling transformations
LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Iterative collective loop fusion
CC'06 Proceedings of the 15th international conference on Compiler Construction
Hi-index | 0.01 |
Loop fusion is important to optimizing compilers because it is an important tool in managing the memory hierarchy. By fusing loops that use the same data elements, we can reduce the distance between accesses to the same datum and avoid costly cache misses. Unfortunately the problem of optimal loop fusion for reuse has been shown to be NP-hard, so compilers must resort to heuristics to avoid unreasonably long compile times. Greedy strategies are often excellent heuristics that produce high-quality solutions quickly. We present an algorithm for greedy weighted fusion, in which the heaviest edge (the one with the most reuse) is selected for possible fusion on each step. The algorithm is shown to be fast in the sense that it takes O(V(E+V)) time, which is arguably optimal for this problem.