Improving cache performance in dynamic applications through data and computation reorganization at run time

Authors:
Chen Ding;Ken Kennedy
Affiliations:
Computer Science Department, Rice University, Houston, TX;Computer Science Department, Rice University, Houston, TX
Venue:
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Year:
1999

Citing 17
Cited 85

Strategies for cache and local memory management by global program transformation

Proceedings of the 1st International Conference on Supercomputing
Improving register allocation for subscripted variables

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Data and program restructuring of irregular applications for cache-coherent multiprocessor

ICS '94 Proceedings of the 8th international conference on Supercomputing
Communication optimizations for irregular scientific computations on distributed memory architectures

Journal of Parallel and Distributed Computing - Special issue on scalability of parallel algorithms and architectures
Unifying data and control transformations for distributed shared-memory machines

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Data and computation transformations for multiprocessors

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Reducing false sharing on shared memory multiprocessors through compile time data transformations

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
Data distribution support on distributed shared memory multiprocessors

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Segregating heap objects by reference behavior and lifetime

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Cache-conscious data placement

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
An Implementation of Interprocedural Bounded Regular Section Analysis

IEEE Transactions on Parallel and Distributed Systems
On the completeness of a generalized matching problem

STOC '78 Proceedings of the tenth annual ACM symposium on Theory of computing
Improving Compiler and Run-Time Support for Adaptive Irregular Codes

PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Memory Hierarchy Management for Iterative Graph Structures

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Cache management by the compiler

Cache management by the compiler

Improving memory hierarchy performance for irregular applications

ICS '99 Proceedings of the 13th international conference on Supercomputing
Compiler and Run-Time Support for Exploiting Regularity within Irregular Applications

IEEE Transactions on Parallel and Distributed Systems
A compiler method for the parallel execution of irregular reductions in scalable shared memory multiprocessors

Proceedings of the 14th international conference on Supercomputing
Improving fine-grained irregular shared-memory benchmarks by data reordering

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Compiler-directed selection of dynamic memory layouts

Proceedings of the ninth international symposium on Hardware/software codesign
Evaluating the impact of memory system performance on software prefetching and locality optimizations

ICS '01 Proceedings of the 15th international conference on Supercomputing
Efficient representations and abstractions for quantifying and exploiting data reference locality

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Data Relation Vectors: A New Abstraction for Data Optimizations

IEEE Transactions on Computers - Special issue on the parallel architecture and compilation techniques conference
Software caching vs. prefetching

Proceedings of the 3rd international symposium on Memory management
Design space optimization of embedded memory systems via data remapping

Proceedings of the joint conference on Languages, compilers and tools for embedded systems: software and compilers for embedded systems
Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings

International Journal of Parallel Programming
Data remapping for design space optimization of embedded memory systems

ACM Transactions on Embedded Computing Systems (TECS)
Rescheduling for Locality in Sparse Matrix Computations

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Compiler and Runtime Support for Irregular Reductions on a Multithreaded Architecture

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
A Comparison of Parallelization Techniques for Irregular Reductions

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Inter-array Data Regrouping

LCPC '99 Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing
Compiler and Run-Time Support for Improving Locality in Scientific Codes

LCPC '99 Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing
Improving Locality for Adaptive Irregular Scientific Codes

LCPC '00 Proceedings of the 13th International Workshop on Languages and Compilers for Parallel Computing-Revised Papers
Reducing Communication Cost for Parallelizing Irregular Scientific Codes

PARA '02 Proceedings of the 6th International Conference on Applied Parallel Computing Advanced Scientific Computing
Memory System Support for Dynamic Cache Line Assembly

IMS '00 Revised Papers from the Second International Workshop on Intelligent Memory Systems
A Comparison of Locality Transformations for Irregular Codes

LCR '00 Selected Papers from the 5th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
Array Unification: A Locality Optimization Technique

CC '01 Proceedings of the 10th International Conference on Compiler Construction
Compiler-directed run-time monitoring of program data access

Proceedings of the 2002 workshop on Memory system performance
Compile-time composition of run-time data and iteration reorderings

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Predicting whole-program locality through reuse distance analysis

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Efficient Data Parallel Algorithms for Multidimensional Array Operations Based on the EKMR Scheme for Distributed Memory Multicomputers

IEEE Transactions on Parallel and Distributed Systems
Array Regrouping and Its Use in Compiling Data-Intensive Embedded Applications

IEEE Transactions on Computers
Optimization techniques for parallel irregular reductions

Journal of Systems Architecture: the EUROMICRO Journal - Special issue: Parallel, distributed and network-based processing
Array regrouping and structure splitting using whole-program reference affinity

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Locality phase prediction

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Quasidynamic Layout Optimizations for Improving Data Locality

IEEE Transactions on Parallel and Distributed Systems
Compiler-Based Approach for Exploiting Scratch-Pad in Presence of Irregular Array Access

Proceedings of the conference on Design, Automation and Test in Europe - Volume 2
The Potential of Computation Regrouping for Improving Locality

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Identifying and Exploiting Spatial Regularity in Data Memory References

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Owl: next generation system monitoring

Proceedings of the 2nd conference on Computing frontiers
Metrics and models for reordering transformations

MSP '04 Proceedings of the 2004 workshop on Memory system performance
Parallel techniques in irregular codes: cloth simulation as case of study

Journal of Parallel and Distributed Computing
Sparse Tiling for Stationary Iterative Methods

International Journal of High Performance Computing Applications
Improving the computational intensity of unstructured mesh applications

Proceedings of the 19th annual international conference on Supercomputing
A hierarchical model of data locality

Conference record of the 33rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Compiler Optimizations to Reduce Security Overhead

Proceedings of the International Symposium on Code Generation and Optimization
Optimizing irregular shared-memory applications for distributed-memory systems

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Zero cost indexing for improved processor cache performance

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Reuse analysis of indirectly indexed arrays

ACM Transactions on Design Automation of Electronic Systems (TODAES)
The hardness of cache conscious data placement

Nordic Journal of Computing
Behavior and communication co-optimization for systems with sequential communication media

Proceedings of the 43rd annual Design Automation Conference
Exploiting Locality for Irregular Scientific Codes

IEEE Transactions on Parallel and Distributed Systems
An Adaptive Algorithm Selection Framework for Reduction Parallelization

IEEE Transactions on Parallel and Distributed Systems
Software behavior oriented parallelization

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Data layouts for object-oriented programs

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Predicting locality phases for dynamic memory optimization

Journal of Parallel and Distributed Computing
Forma: A framework for safe automatic array reshaping

ACM Transactions on Programming Languages and Systems (TOPLAS)
An analytical model of locality-based parallel irregular reductions

Parallel Computing
MPADS: memory-pooling-assisted data splitting

Proceedings of the 7th international symposium on Memory management
Online Phase-Adaptive Data Layout Selection

ECOOP '08 Proceedings of the 22nd European conference on Object-Oriented Programming
Revisiting Cache Block Superloading

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Fast Track: A Software System for Speculative Program Optimization

Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
Program locality analysis using reuse distance

ACM Transactions on Programming Languages and Systems (TOPLAS)
Evaluation of Hierarchical Mesh Reorderings

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Virtual reuse distance analysis of SPECjvm2008 data locality

PPPJ '09 Proceedings of the 7th International Conference on Principles and Practice of Programming in Java
Adaptive scratch pad memory management for dynamic behavior of multimedia applications

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Custom memory allocation for free

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
Balanced, locality-based parallel irregular reductions

LCPC'01 Proceedings of the 14th international conference on Languages and compilers for parallel computing
A graph theoretic approach to cache-conscious placement of data for direct mapped caches

Proceedings of the 2010 international symposium on Memory management
Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping

Proceedings of the 24th ACM International Conference on Supercomputing
Improving MPI communication via data type fission

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Exploring a Novel Gathering Method for Finite Element Codes on the Cell/B.E. Architecture

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
On improving the performance of data partitioning oriented parallel irregular reductions

EUROMICRO-PDP'02 Proceedings of the 10th Euromicro conference on Parallel, distributed and network-based processing
Region-based parallelization of irregular reductions on explicitly managed memory hierarchies

The Journal of Supercomputing
An execution strategy and optimized runtime support for parallelizing irregular reductions on modern GPUs

Proceedings of the international conference on Supercomputing
Task ordering and memory management problem for degree of parallelism estimation

COCOON'11 Proceedings of the 17th annual international conference on Computing and combinatorics
Applying data copy to improve memory performance of general array computations

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
On-the-fly structure splitting for heap objects

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Optimizing data locality using array tiling

Proceedings of the International Conference on Computer-Aided Design
Optimization of dense matrix multiplication on IBM cyclops-64: challenges and experiences

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Combining performance aspects of irregular gauss-seidel via sparse tiling

LCPC'02 Proceedings of the 15th international conference on Languages and Compilers for Parallel Computing
Optimization-Oriented visualization of cache access behavior

ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part II
A data layout optimization framework for NUCA-based multicores

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Analysis of the spatial and temporal locality in data accesses

ICCS'06 Proceedings of the 6th international conference on Computational Science - Volume Part II
Automatically enhancing locality for tree traversals with traversal splicing

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Code generation for parallel execution of a class of irregular loops on distributed memory systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Exploiting domain knowledge to optimize parallel computational mechanics codes

Proceedings of the 27th international ACM conference on International conference on supercomputing
Reshaping cache misses to improve row-buffer locality in multicore systems

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Toward application-specific memory reconfiguration for energy efficiency

E2SC '13 Proceedings of the 1st International Workshop on Energy Efficient Supercomputing
Non-affine Extensions to Polyhedral Code Generation

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the rapid improvement of processor speed, performance of the memory hierarchy has become the principal bottleneck for most applications. A number of compiler transformations have been developed to improve data reuse in cache and registers, thus reducing the total number of direct memory accesses in a program. Until now, however, most data reuse transformations have been static---applied only at compile time. As a result, these transformations cannot be used to optimize irregular and dynamic applications, in which the data layout and data access patterns remain unknown until run time and may even change during the computation.In this paper, we explore ways to achieve better data reuse in irregular and dynamic applications by building on the inspector-executor method used by Saltz for run-time parallelization. In particular, we present and evaluate a dynamic approach for improving both computation and data locality in irregular programs. Our results demonstrate that run-time program transformations can substantially improve computation and data locality and, despite the complexity and cost involved, a compiler can automate such transformations, eliminating much of the associated run-time overhead.