Evaluating Associativity in CPU Caches
IEEE Transactions on Computers
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Memory-hierarchy management
Compiler optimizations for improving data locality
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Proceedings of the 14th international conference on Supercomputing
Data locality enhancement by memory reduction
ICS '01 Proceedings of the 15th international conference on Supercomputing
Optimizing compilers for modern architectures: a dependence-based approach
Optimizing compilers for modern architectures: a dependence-based approach
Improving Effective Bandwidth through Compiler Enhancement of Global Cache Reuse
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Collective Loop Fusion for Array Contraction
Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution
Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
14.9 TFLOPS three-dimensional fluid simulation for fusion science with HPF on the Earth Simulator
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
On the Complexity of Loop Fusion
PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
IMPLEMENTATION OF A FULLY-BALANCED PERIODIC TRIDIAGONAL SOLVER ON A PARALLEL DISTRIBUTED MEMORY ARCHITECTURE
Fast searches for effective optimization phase sequences
Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Using Machine Learning to Focus Iterative Optimization
Proceedings of the International Symposium on Code Generation and Optimization
Evaluating iterative compilation
LCPC'02 Proceedings of the 15th international conference on Languages and Compilers for Parallel Computing
Towards making autotuning mainstream
International Journal of High Performance Computing Applications
Hi-index | 0.00 |
Loop fusion is recognised as an effective transformation for improving memory hierarchy performance. However, unconstrained loop fusion can lead to poor performance because of increased register pressure and cache conflict misses. In this paper, we present a cache-conscious analytical model for profitable loop fusion. We use this model to tune fusion parameters for different architectures through empirical search. Experiments on four different platforms for a set of applications show significant speedup over fully optimised code generated by state-of-the-art commercial compilers.