Strategies for cache and local memory management by global program transformation
Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
SUIF: an infrastructure for research on parallelizing and optimizing compilers
ACM SIGPLAN Notices
Compiling for numa parallel machines
Compiling for numa parallel machines
Unifying data and control transformations for distributed shared-memory machines
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Tile size selection using cache organization and data layout
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
The meeting graph: a new model for loop cyclic register allocation
PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
System partitioning to maximize sleep time
ICCAD '95 Proceedings of the 1995 IEEE/ACM international conference on Computer-aided design
Improving data locality with loop transformations
ACM Transactions on Programming Languages and Systems (TOPLAS)
Data-centric multi-level blocking
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
A compiler algorithm for optimizing locality in loop nests
ICS '97 Proceedings of the 11th international conference on Supercomputing
Non-singular data transformations: definition, validity and applications
ICS '97 Proceedings of the 11th international conference on Supercomputing
Data transformations for eliminating conflict misses
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
A hyperplane based approach for optimizing spatial locality in loop nests
ICS '98 Proceedings of the 12th international conference on Supercomputing
Advanced compiler design and implementation
Advanced compiler design and implementation
Cache-conscious structure layout
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Improving memory hierarchy performance for irregular applications
ICS '99 Proceedings of the 13th international conference on Supercomputing
Energy-driven integrated hardware-software optimizations using SimplePower
Proceedings of the 27th annual international symposium on Computer architecture
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Memory controller policies for DRAM power management
ISLPED '01 Proceedings of the 2001 international symposium on Low power electronics and design
Custom Memory Management Methodology: Exploration of Memory Organisation for Embedded Multimedia System Design
Design of High-Performance Microprocessor Circuits
Design of High-Performance Microprocessor Circuits
High Performance Compilers for Parallel Computing
High Performance Compilers for Parallel Computing
Improving Effective Bandwidth through Compiler Enhancement of Global Cache Reuse
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Reduction of Cache Coherence Overhead by Compiler Data Layout and Loop Transformation
Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Influence of Loop Optimizations on Energy Consumption of Multi-bank Memory Systems
CC '02 Proceedings of the 11th International Conference on Compiler Construction
Integrating Loop and Data Transformations for Global Optimisation
PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Localizing Non-Affine Array References
PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
Compiler-Directed Array Interleaving for Reducing Energy in Multi-Bank Memories
ASP-DAC '02 Proceedings of the 2002 Asia and South Pacific Design Automation Conference
DRAM Energy Management Using Sof ware and Hardware Directed Power Mode Control
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Memory Hierarchy Management for Iterative Graph Structures
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Cache management by the compiler
Cache management by the compiler
Code generation and optimization for embedded digital signal processors
Code generation and optimization for embedded digital signal processors
Improving effective bandwidth through compiler enhancement of global and dynamic cache reuse
Improving effective bandwidth through compiler enhancement of global and dynamic cache reuse
Exploiting bank locality in multi-bank memories
Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Code transformation and instruction set extension
ACM Transactions on Embedded Computing Systems (TECS)
Optimizing local memory allocation and assignment through a decoupled approach
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Hi-index | 14.98 |
Abstract--One of the key challenges facing computer architects and compiler writers is the increasing discrepancy between processor cycle times and main memory access times. To alleviate this problem in array-intensive embedded signal and video processing applications, compilers may employ either control-centric transformations that change data access patterns of nested loops or data-centric transformations that modify memory layouts of multidimensional arrays. Most of the memory layout optimizations proposed so far either modify the layout of each array independently or are based on explicit data reorganizations at runtime. This paper focuses on a compiler technique, called array regrouping, that automatically maps multiple arrays into a single data (array) space to improve data access pattern. We present a mathematical framework that enables us to systematically derive suitable mappings for a given array-intensive embedded application. The framework divides the arrays accessed in a given program into several groups and each group is independently layout-transformed to improve spatial locality and reduce the number of conflict misses. As compared to the previous approaches, the proposed technique makes two new contributions: 1) It presents a graph based formulation of the array regrouping problem and 2) it demonstrates potential benefits of this aggressive array-regrouping strategy in optimizing behavior of embedded systems. Extensive experimental results demonstrate significant improvements in cache miss rates and execution times. An important advantage of this approach over the previous techniques that target conflict misses is that it reduces conflict misses without increasing the data space requirements of the application being optimized. This is a very desirable property in many embedded/portable environments where data space requirements determine the minimum physical memory capacity. In addition to performance related issues, with the increased use of embedded/portable devices, improving energy efficiency of applications is becoming a critical issue. To develop a truly energy-efficient system, energy constraints should be taken into account early in the design process, i.e., at the source level in software compilation and behavioral level in hardware compilation. Source-level optimizations are particularly important in data-dominated media applications. In this paper, we also show how our array regrouping strategy increases energy savings from using multiple low-power operating modes provided in current memory modules. Using a set of array-intensive benchmarks, we observe significant savings in memory system energy.