Strategies for cache and local memory management by global program transformation
Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
Proceedings of the 1989 ACM/IEEE conference on Supercomputing
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Improving the ratio of memory operations to floating-point operations in loops
ACM Transactions on Programming Languages and Systems (TOPLAS)
Tile size selection using cache organization and data layout
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
The Omega Library interface guide
The Omega Library interface guide
Improving data locality with loop transformations
ACM Transactions on Programming Languages and Systems (TOPLAS)
An analysis of dag-consistent distributed shared-memory algorithms
Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
Data-centric multi-level blocking
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Iteration space slicing and its application to communication optimization
ICS '97 Proceedings of the 11th international conference on Supercomputing
Compiler blockability of dense matrix factorizations
ACM Transactions on Mathematical Software (TOMS)
Using integer sets for data-parallel program analysis and optimization
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Recursion leads to automatic variable blocking for dense linear-algebra algorithms
IBM Journal of Research and Development
New tiling techniques to improve cache temporal locality
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Improving memory hierarchy performance for irregular applications
ICS '99 Proceedings of the 13th international conference on Supercomputing
Nonlinear array layouts for hierarchical memory systems
ICS '99 Proceedings of the 13th international conference on Supercomputing
Architecture-cognizant divide and conquer algorithms
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Hierarchical tiling for improved superscalar performance
IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
Blocking Linear Algebra Codes for Memory Hierarchies
Proceedings of the Fourth SIAM Conference on Parallel Processing for Scientific Computing
Code generation for multiple mappings
FRONTIERS '95 Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation (Frontiers'95)
A study of instruction cache organizations and replacement policies
ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
Space-limited procedures: a methodology for portable high-performance
PMMP '95 Proceedings of the conference on Programming Models for Massively Parallel Computers
Fine-grained analysis of array computations
Fine-grained analysis of array computations
Tiling optimizations for 3D scientific computations
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Language support for Morton-order matrices
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
A locality-preserving cache-oblivious dynamic dictionary
SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Synthesizing Transformations for Locality Enhancement of Imperfectly-Nested Loop Nests
International Journal of Parallel Programming
Increasing temporal locality with skewing and recursive blocking
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Achieving Scalable Locality with Time Skewing
International Journal of Parallel Programming
Improving Effective Bandwidth through Compiler Enhancement of Global Cache Reuse
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Tiling, Block Data Layout, and Memory Hierarchy Performance
IEEE Transactions on Parallel and Distributed Systems
Transforming Complex Loop Nests for Locality
The Journal of Supercomputing
Single Assignment C: efficient support for high-level array operations in a functional setting
Journal of Functional Programming
Improving effective bandwidth through compiler enhancement of global cache reuse
Journal of Parallel and Distributed Computing
The Potential of Computation Regrouping for Improving Locality
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Statistical Models for Empirical Search-Based Performance Tuning
International Journal of High Performance Computing Applications
Improving Memory Hierarchy Performance through Combined Loop Interchange and Multi-Level Fusion
International Journal of High Performance Computing Applications
A hierarchical model of data locality
Conference record of the 33rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Analyzing data reuse for cache reconfiguration
ACM Transactions on Embedded Computing Systems (TECS)
Program locality analysis using reuse distance
ACM Transactions on Programming Languages and Systems (TOPLAS)
Algorithms for memory hierarchies: advanced lectures
Algorithms for memory hierarchies: advanced lectures
Partool: a feedback-directed parallelizer
APPT'11 Proceedings of the 9th international conference on Advanced parallel processing technologies
Optimizing matrix multiplication with a classifier learning system
LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
JuliusC: a practical approach for the analysis of divide-and-conquer algorithms
LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Hi-index | 0.00 |
Recently, there have been several experimental and theoretical results showing significant performance benefits of recursive algorithms on bothmulti-level memory hierarchies and on shared-memory systems. In particular, such algorithms have the data reuse characteristics of a blocked algorithm that is simultaneously blocked at many different levels. Most existing applications, however, are written using ordinary loops. We present a new compiler transformation that can be used to convert loop nests into recursive form automatically. We show that the algorithm is fast and effective, handling loop nests with arbitrary nesting and control flow. The transformation achieves substantial performance improvements for several linear algebra codes even on a current system with a two level cache hierarchy. As a side-effect of this work, we also develop an improved algorithm for transitive dependence analysis (a powerful technique used in the recursion transformation and other loop transformations)that is much faster than the best previously known algorithm in practice.