Automatic decomposition of scientific programs for parallel execution
POPL '87 Proceedings of the 14th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
A new approach to the maximum-flow problem
Journal of the ACM (JACM)
Improving register allocation for subscripted variables
PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
Optimization of array accesses by collective loop transformations
ICS '91 Proceedings of the 5th international conference on Supercomputing
IEEE Transactions on Computers
Memory-hierarchy management
Optimal weighted loop fusion for parallel programs
Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
The Design and Analysis of Computer Algorithms
The Design and Analysis of Computer Algorithms
The Classification, Fusion, and Parallelization of Array Language Primitives
IEEE Transactions on Parallel and Distributed Systems
Collective Loop Fusion for Array Contraction
Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution
Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
LFP '84 Proceedings of the 1984 ACM Symposium on LISP and functional programming
On the Complexity of Loop Fusion
PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
The Memory Bandwidth Bottleneck and its Amelioration by a Compiler
IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Dependence analysis for subscripted variables and its application to program transformations
Dependence analysis for subscripted variables and its application to program transformations
Improving effective bandwidth through compiler enhancement of global and dynamic cache reuse
Improving effective bandwidth through compiler enhancement of global and dynamic cache reuse
On minimizing materializations of array-valued temporaries
ACM Transactions on Programming Languages and Systems (TOPLAS)
Buffer and Register Allocation for Memory Space Optimization
Journal of VLSI Signal Processing Systems
Multiprocessor, Multithreading and Memory Optimization for On-Chip Multimedia Applications
Journal of Signal Processing Systems
Integrating Memory Optimization with Mapping Algorithms for Multi-Processors System-on-Chip
ACM Transactions on Embedded Computing Systems (TECS)
Hi-index | 0.00 |
Loop fusion is an important compiler strategy for managing memory hierarchy. By fusing loops that use the same data elements, a compiler can reduce the distance between accesses to the same datum and avoid costly cache misses. Unfortunately the problem of optimal loop fusion for reuse has been shown to be NP-hard, so compilers must resort to heuristics to avoid unreasonably long compile times. Greedy strategies are often excellent heuristics that produce high-quality solutions quickly. We present an algorithm for greedy weighted fusion, in which the heaviest edge (the one with the most reuse) is selected for possible fusion on each step. The algorithm is shown to be fast in the sense that it takes O(V(E+V)) time, which is arguably optimal for producing the greedy solution. In addition, this algorithm has the advantage that it requires only O(E) edge reweighting operations after fusions. This means that it is suitable for use on the problem of enhancing cache reuse, for which the ideal reweighting operation is much more complex than addition. If each reweighting operation requires no more than O(V) time, the time bound of the overall fusion process remains at O(V(E+V)).