Strategies for cache and local memory management by global program transformation
Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
Evaluating Associativity in CPU Caches
IEEE Transactions on Computers
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
SUIF: an infrastructure for research on parallelizing and optimizing compilers
ACM SIGPLAN Notices
Compiling for numa parallel machines
Compiling for numa parallel machines
Unifying data and control transformations for distributed shared-memory machines
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Data and computation transformations for multiprocessors
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Reducing false sharing on shared memory multiprocessors through compile time data transformations
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
A data cache with multiple caching strategies tuned to different types of locality
ICS '95 Proceedings of the 9th international conference on Supercomputing
Improving data locality with loop transformations
ACM Transactions on Programming Languages and Systems (TOPLAS)
Data-centric multi-level blocking
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
A compiler algorithm for optimizing locality in loop nests
ICS '97 Proceedings of the 11th international conference on Supercomputing
Compiler blockability of dense matrix factorizations
ACM Transactions on Mathematical Software (TOMS)
The Cache-Coherence Problem in Shared-Memory Multiprocessors: Hardware Solutions
The Cache-Coherence Problem in Shared-Memory Multiprocessors: Hardware Solutions
High Performance Compilers for Parallel Computing
High Performance Compilers for Parallel Computing
False Sharing and Spatial Locality in Multiprocessor Caches
IEEE Transactions on Computers
Briki: an Optimizing Java Compiler
COMPCON '97 Proceedings of the 42nd IEEE International Computer Conference
Static Locality Analysis for Cache Management
PACT '97 Proceedings of the 1997 International Conference on Parallel Architectures and Compilation Techniques
Co-design of interleaved memory systems
CODES '00 Proceedings of the eighth international workshop on Hardware/software codesign
A preprocessing step for global loop transformations for data transfer optimization
CASES '00 Proceedings of the 2000 international conference on Compilers, architecture, and synthesis for embedded systems
Cache conscious data layout organization for embedded multimedia applications
Proceedings of the conference on Design, automation and test in Europe
Exploiting non-uniform reuse for cache optimization
Proceedings of the 2001 ACM symposium on Applied computing
An empirical evaluation of high level transformations for embedded processors
CASES '01 Proceedings of the 2001 international conference on Compilers, architecture, and synthesis for embedded systems
Data Relation Vectors: A New Abstraction for Data Optimizations
IEEE Transactions on Computers - Special issue on the parallel architecture and compilation techniques conference
Efficient Representation Scheme for Multidimensional Array Operations
IEEE Transactions on Computers
Automatic Partitioning of Parallel Loops with Parallelepiped-Shaped Tiles
IEEE Transactions on Parallel and Distributed Systems
Reducing Cache Conflicts by Multi-Level Cache Partitioning and Array Elements Mapping
The Journal of Supercomputing
ACM Transactions on Design Automation of Electronic Systems (TODAES)
A Layout-Conscious Iteration Space Transformation Technique
IEEE Transactions on Computers
Array recovery and high-level transformations for DSP applications
ACM Transactions on Embedded Computing Systems (TECS)
Advanced Data Layout Optimization for Multimedia Applications
IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
Improving cache hit ratio by extended referencing cache lines
Journal of Computing Sciences in Colleges
Predicting the impact of optimizations for embedded systems
Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systems
Highly accurate and efficient evaluation of randomising set index functions
Journal of Systems Architecture: the EUROMICRO Journal
IEEE Transactions on Parallel and Distributed Systems
Instruction Scheduling for Low Power
Journal of VLSI Signal Processing Systems
Proceedings of the 1st conference on Computing frontiers
A proposal for input-sensitivity analysis of profile-driven optimizations on embedded applications
MEDEA '03 Proceedings of the 2003 workshop on MEmory performance: DEaling with Applications , systems and architecture
IEEE Transactions on Computers
A Complete Compiler Approach to Auto-Parallelizing C Programs for Multi-DSP Systems
IEEE Transactions on Parallel and Distributed Systems
A Model-Based Framework: An Approach for Profit-Driven Optimization
Proceedings of the international symposium on Code generation and optimization
Compiler-Based Approach for Exploiting Scratch-Pad in Presence of Irregular Array Access
Proceedings of the conference on Design, Automation and Test in Europe - Volume 2
Hyperplane Grouping and Pipelined Schedules: How to Execute Tiled Loops Fast on Clusters of SMPs
The Journal of Supercomputing
Exploiting Inter-Processor Data Sharing for Improving Behavior of Multi-Processor SoCs
ISVLSI '05 Proceedings of the IEEE Computer Society Annual Symposium on VLSI: New Frontiers in VLSI Design
The Journal of Supercomputing
Optimizing instruction cache performance of embedded systems
ACM Transactions on Embedded Computing Systems (TECS)
Reuse analysis of indirectly indexed arrays
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Journal of Systems and Software
An approach toward profit-driven optimization
ACM Transactions on Architecture and Code Optimization (TACO)
Journal of Systems Architecture: the EUROMICRO Journal
External memory page remapping for embedded multimedia systems
Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Systematic intermediate sequence removal for reduced memory accesses
SCOPES '07 Proceedingsof the 10th international workshop on Software & compilers for embedded systems
Locality optimization in wireless applications
CODES+ISSS '07 Proceedings of the 5th IEEE/ACM international conference on Hardware/software codesign and system synthesis
Fast indexing for blocked array layouts to reduce cache misses
International Journal of High Performance Computing and Networking
ILP-Based energy minimization techniques for banked memories
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Improving data locality by chunking
CC'03 Proceedings of the 12th international conference on Compiler construction
An Efficient Memory Organization for High-ILP Inner Modem Baseband SDR Processors
Journal of Signal Processing Systems
Improving MPI communication via data type fission
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Polyhedral Model Based Data Locality Optimization for Embedded Applications
GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
Tuning blocked array layouts to exploit memory hierarchy in SMT architectures
PCI'05 Proceedings of the 10th Panhellenic conference on Advances in Informatics
Hi-index | 0.01 |
Exploiting locality of reference is key to realizing high levels of performance on modern processors. This paper describes a compiler algorithm for optimizing cache locality in scientific codes on uniprocessor and multiprocessor machines. A distinctive characteristic of our algorithm is that it considers loop and data layout transformations in a unified framework. Our approach is very effective at reducing cache misses and can optimize some nests for which optimization techniques based on loop transformations alone are not successful. An important special case is one in which data layouts of some arrays are fixed and cannot be changed. We show how our algorithm can accommodate this case and demonstrate how it can be used to optimize multiple loop nests. Experiments on several benchmarks show that the techniques presented in this paper result in substantial improvement in cache performance.