Transforming loops to recursion for multi-level memory hierarchies

Authors:
Qing Yi;Vikram Adve;Ken Kennedy
Affiliations:
Rice University, Houston, TX;University of Illinois at Urbana-Champaign, Urbana, IL;Rice University, Houston, TX
Venue:
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Year:
2000

Citing 24
Cited 21

Strategies for cache and local memory management by global program transformation

Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
More iteration space tiling

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Improving the ratio of memory operations to floating-point operations in loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
The Omega Library interface guide

The Omega Library interface guide
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
An analysis of dag-consistent distributed shared-memory algorithms

Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
Data-centric multi-level blocking

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Iteration space slicing and its application to communication optimization

ICS '97 Proceedings of the 11th international conference on Supercomputing
Compiler blockability of dense matrix factorizations

ACM Transactions on Mathematical Software (TOMS)
Using integer sets for data-parallel program analysis and optimization

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Recursion leads to automatic variable blocking for dense linear-algebra algorithms

IBM Journal of Research and Development
New tiling techniques to improve cache temporal locality

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Improving memory hierarchy performance for irregular applications

ICS '99 Proceedings of the 13th international conference on Supercomputing
Nonlinear array layouts for hierarchical memory systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
Architecture-cognizant divide and conquer algorithms

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Hierarchical tiling for improved superscalar performance

IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
Blocking Linear Algebra Codes for Memory Hierarchies

Proceedings of the Fourth SIAM Conference on Parallel Processing for Scientific Computing
Code generation for multiple mappings

FRONTIERS '95 Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation (Frontiers'95)
A study of instruction cache organizations and replacement policies

ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
Space-limited procedures: a methodology for portable high-performance

PMMP '95 Proceedings of the conference on Programming Models for Massively Parallel Computers
Fine-grained analysis of array computations

Fine-grained analysis of array computations

Tiling optimizations for 3D scientific computations

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Language support for Morton-order matrices

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
A locality-preserving cache-oblivious dynamic dictionary

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Synthesizing Transformations for Locality Enhancement of Imperfectly-Nested Loop Nests

International Journal of Parallel Programming
Increasing temporal locality with skewing and recursive blocking

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Achieving Scalable Locality with Time Skewing

International Journal of Parallel Programming
Improving Effective Bandwidth through Compiler Enhancement of Global Cache Reuse

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Tiling, Block Data Layout, and Memory Hierarchy Performance

IEEE Transactions on Parallel and Distributed Systems
Transforming Complex Loop Nests for Locality

The Journal of Supercomputing
Single Assignment C: efficient support for high-level array operations in a functional setting

Journal of Functional Programming
Improving effective bandwidth through compiler enhancement of global cache reuse

Journal of Parallel and Distributed Computing
The Potential of Computation Regrouping for Improving Locality

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Statistical Models for Empirical Search-Based Performance Tuning

International Journal of High Performance Computing Applications
Improving Memory Hierarchy Performance through Combined Loop Interchange and Multi-Level Fusion

International Journal of High Performance Computing Applications
A hierarchical model of data locality

Conference record of the 33rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Analyzing data reuse for cache reconfiguration

ACM Transactions on Embedded Computing Systems (TECS)
Program locality analysis using reuse distance

ACM Transactions on Programming Languages and Systems (TOPLAS)
Algorithms for memory hierarchies: advanced lectures

Algorithms for memory hierarchies: advanced lectures
Partool: a feedback-directed parallelizer

APPT'11 Proceedings of the 9th international conference on Advanced parallel processing technologies
Optimizing matrix multiplication with a classifier learning system

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
JuliusC: a practical approach for the analysis of divide-and-conquer algorithms

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recently, there have been several experimental and theoretical results showing significant performance benefits of recursive algorithms on bothmulti-level memory hierarchies and on shared-memory systems. In particular, such algorithms have the data reuse characteristics of a blocked algorithm that is simultaneously blocked at many different levels. Most existing applications, however, are written using ordinary loops. We present a new compiler transformation that can be used to convert loop nests into recursive form automatically. We show that the algorithm is fast and effective, handling loop nests with arbitrary nesting and control flow. The transformation achieves substantial performance improvements for several linear algebra codes even on a current system with a two level cache hierarchy. As a side-effect of this work, we also develop an improved algorithm for transitive dependence analysis (a powerful technique used in the recursion transformation and other loop transformations)that is much faster than the best previously known algorithm in practice.