Improving Cache Locality by a Combination of Loop and Data Transformations

Authors:
Mahmut Kandemir;J. Ramanujam;Alok Choudhary
Affiliations:
Syracuse Univ., Syracuse, NY;Louisiana State Univ., Baton Rouge, LA;Northwestern Univ., Evanston, IL
Venue:
IEEE Transactions on Computers - Special issue on cache memory and related problems
Year:
1999

Citing 19
Cited 42

Strategies for cache and local memory management by global program transformation

Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
Evaluating Associativity in CPU Caches

IEEE Transactions on Computers
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
SUIF: an infrastructure for research on parallelizing and optimizing compilers

ACM SIGPLAN Notices
Compiling for numa parallel machines

Compiling for numa parallel machines
Unifying data and control transformations for distributed shared-memory machines

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Data and computation transformations for multiprocessors

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Reducing false sharing on shared memory multiprocessors through compile time data transformations

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
A data cache with multiple caching strategies tuned to different types of locality

ICS '95 Proceedings of the 9th international conference on Supercomputing
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
Data-centric multi-level blocking

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
A compiler algorithm for optimizing locality in loop nests

ICS '97 Proceedings of the 11th international conference on Supercomputing
Compiler blockability of dense matrix factorizations

ACM Transactions on Mathematical Software (TOMS)
The Cache-Coherence Problem in Shared-Memory Multiprocessors: Hardware Solutions

The Cache-Coherence Problem in Shared-Memory Multiprocessors: Hardware Solutions
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
False Sharing and Spatial Locality in Multiprocessor Caches

IEEE Transactions on Computers
Briki: an Optimizing Java Compiler

COMPCON '97 Proceedings of the 42nd IEEE International Computer Conference
Static Locality Analysis for Cache Management

PACT '97 Proceedings of the 1997 International Conference on Parallel Architectures and Compilation Techniques

Co-design of interleaved memory systems

CODES '00 Proceedings of the eighth international workshop on Hardware/software codesign
A preprocessing step for global loop transformations for data transfer optimization

CASES '00 Proceedings of the 2000 international conference on Compilers, architecture, and synthesis for embedded systems
Cache conscious data layout organization for embedded multimedia applications

Proceedings of the conference on Design, automation and test in Europe
Exploiting non-uniform reuse for cache optimization

Proceedings of the 2001 ACM symposium on Applied computing
An empirical evaluation of high level transformations for embedded processors

CASES '01 Proceedings of the 2001 international conference on Compilers, architecture, and synthesis for embedded systems
Data Relation Vectors: A New Abstraction for Data Optimizations

IEEE Transactions on Computers - Special issue on the parallel architecture and compilation techniques conference
Efficient Representation Scheme for Multidimensional Array Operations

IEEE Transactions on Computers
Automatic Partitioning of Parallel Loops with Parallelepiped-Shaped Tiles

IEEE Transactions on Parallel and Distributed Systems
Reducing Cache Conflicts by Multi-Level Cache Partitioning and Array Elements Mapping

The Journal of Supercomputing
Search space definition and exploration for nonuniform data reuse opportunities in data-dominant applications

ACM Transactions on Design Automation of Electronic Systems (TODAES)
A Layout-Conscious Iteration Space Transformation Technique

IEEE Transactions on Computers
Array recovery and high-level transformations for DSP applications

ACM Transactions on Embedded Computing Systems (TECS)
Advanced Data Layout Optimization for Multimedia Applications

IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
Improving cache hit ratio by extended referencing cache lines

Journal of Computing Sciences in Colleges
Predicting the impact of optimizations for embedded systems

Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systems
Highly accurate and efficient evaluation of randomising set index functions

Journal of Systems Architecture: the EUROMICRO Journal
Efficient Data Parallel Algorithms for Multidimensional Array Operations Based on the EKMR Scheme for Distributed Memory Multicomputers

IEEE Transactions on Parallel and Distributed Systems
Instruction Scheduling for Low Power

Journal of VLSI Signal Processing Systems
Overcoming the "Memory Wall" by improved system design exploration and a link to process technology options

Proceedings of the 1st conference on Computing frontiers
A proposal for input-sensitivity analysis of profile-driven optimizations on embedded applications

MEDEA '03 Proceedings of the 2003 workshop on MEmory performance: DEaling with Applications , systems and architecture
Cache Conscious Data Layout Organization for Conflict Miss Reduction in Embedded Multimedia Applications

IEEE Transactions on Computers
A Complete Compiler Approach to Auto-Parallelizing C Programs for Multi-DSP Systems

IEEE Transactions on Parallel and Distributed Systems
A Model-Based Framework: An Approach for Profit-Driven Optimization

Proceedings of the international symposium on Code generation and optimization
Compiler-Based Approach for Exploiting Scratch-Pad in Presence of Irregular Array Access

Proceedings of the conference on Design, Automation and Test in Europe - Volume 2
Hyperplane Grouping and Pipelined Schedules: How to Execute Tiled Loops Fast on Clusters of SMPs

The Journal of Supercomputing
Exploiting Inter-Processor Data Sharing for Improving Behavior of Multi-Processor SoCs

ISVLSI '05 Proceedings of the IEEE Computer Society Annual Symposium on VLSI: New Frontiers in VLSI Design
Efficient Data Distribution Schemes for EKMR-Based Sparse Arrays on Distributed Memory Multicomputers

The Journal of Supercomputing
Optimizing instruction cache performance of embedded systems

ACM Transactions on Embedded Computing Systems (TECS)
Reuse analysis of indirectly indexed arrays

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Polyhedral space generation and memory estimation from interface and memory models of real-time video systems

Journal of Systems and Software
An approach toward profit-driven optimization

ACM Transactions on Architecture and Code Optimization (TACO)
Systematic methodology for exploration of performance - Energy trade-offs in network applications using Dynamic Data Type refinement

Journal of Systems Architecture: the EUROMICRO Journal
External memory page remapping for embedded multimedia systems

Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Systematic intermediate sequence removal for reduced memory accesses

SCOPES '07 Proceedingsof the 10th international workshop on Software & compilers for embedded systems
Locality optimization in wireless applications

CODES+ISSS '07 Proceedings of the 5th IEEE/ACM international conference on Hardware/software codesign and system synthesis
Fast indexing for blocked array layouts to reduce cache misses

International Journal of High Performance Computing and Networking
ILP-Based energy minimization techniques for banked memories

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Improving data locality by chunking

CC'03 Proceedings of the 12th international conference on Compiler construction
An Efficient Memory Organization for High-ILP Inner Modem Baseband SDR Processors

Journal of Signal Processing Systems
Improving MPI communication via data type fission

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Polyhedral Model Based Data Locality Optimization for Embedded Applications

GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
Tuning blocked array layouts to exploit memory hierarchy in SMT architectures

PCI'05 Proceedings of the 10th Panhellenic conference on Advances in Informatics

Quantified Score

Hi-index	0.01

Visualization

Abstract

Exploiting locality of reference is key to realizing high levels of performance on modern processors. This paper describes a compiler algorithm for optimizing cache locality in scientific codes on uniprocessor and multiprocessor machines. A distinctive characteristic of our algorithm is that it considers loop and data layout transformations in a unified framework. Our approach is very effective at reducing cache misses and can optimize some nests for which optimization techniques based on loop transformations alone are not successful. An important special case is one in which data layouts of some arrays are fixed and cannot be changed. We show how our algorithm can accommodate this case and demonstrate how it can be used to optimize multiple loop nests. Experiments on several benchmarks show that the techniques presented in this paper result in substantial improvement in cache performance.