Improving power efficiency with compiler-assisted cache replacement

Authors:
Hongbo Yang;R. Govindarajan;Guang R. Gao;Ziang Hu
Affiliations:
Sandbridge Technologies Inc., 1 N. Lexington Ave, White Plains, NY 10601, USA (Corresponding author. E-mail: hyang@sandbridgetech.com);Supercomputer Education and Research Centre and the Department of Computer Science and Automation, Indian Institute of Science, Banglore, India;Department of Electrical and Computer Engineering, University of Delaware, Newark, DE 19716, USA;Department of Electrical and Computer Engineering, University of Delaware, Newark, DE 19716, USA
Venue:
Journal of Embedded Computing - Cache exploitation in embedded systems
Year:
2005

Citing 15
Cited 3

A hypercube algorithm for the 0/1 knapsack problem

Journal of Parallel and Distributed Computing
Strategies for cache and local memory management by global program transformation

Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
Unified management of registers and cache using liveness and cache bypass

PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
Introduction to algorithms

Introduction to algorithms
Improving register allocation for subscripted variables

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Tolerating latency through software-controlled data prefetching

Tolerating latency through software-controlled data prefetching
Combining loop transformations considering caches and scheduling

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Computer architecture (2nd ed.): a quantitative approach

Computer architecture (2nd ed.): a quantitative approach
Quantifying loop nest locality using SPEC'95 and the perfect benchmarks

ACM Transactions on Computer Systems (TOCS)
Data and memory optimization techniques for embedded systems

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
Locality Analysis for Distributed Shared-Memory Multiprocessors

LCPC '96 Proceedings of the 9th International Workshop on Languages and Compilers for Parallel Computing
Using the Compiler to Improve Cache Replacement Decisions

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Data cache locking for higher program predictability

SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems

Instruction cache locking inside a binary rewriter

CASES '09 Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems
Instruction cache locking using temporal reuse profile

Proceedings of the 47th Design Automation Conference
Link-time optimization for power efficiency in a tagless instruction cache

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data cache in embedded systems plays the roles of both speeding up program execution and reducing power consumption. However, a hardware-only cache management scheme usually results in unsatisfactory cache utilization. In several new architectures, cache management details are accessible at instruction level, enabling the involvement of compiler for better cache performance. In particular, Intel XScale implemented the cache-locking mechanism, which enables the compiler to lock certain critical data in the cache and it is guaranteed that the locked data will not be evicted from the cache. In such an architecture, what-to-lock and when-to-lock are important issues to achieve good cache performance. To this end, this paper gives a 0/1 knapsack problem formulation, which can be efficiently solved using a dynamic programming algorithm. We implemented this formulation in the MIPSpro compiler and our approach reduces both execution time and power consumption. The measured power and performance on an Xscale processor show that our method can achieve improved execution time than data prefetching at similar or reduced power consumption.