Power-efficient prefetching for embedded processors

Authors:
Xiaotong Zhuang;Santosh Pande
Affiliations:
Georgia Institute of Technology, Atlanta, GA;Georgia Institute of Technology, Atlanta, GA
Venue:
ACM Transactions on Embedded Computing Systems (TECS)
Year:
2007

Citing 22
Cited 2

Compilers: principles, techniques, and tools

Compilers: principles, techniques, and tools
Embedding trees in a hypercube is NP-complete

SIAM Journal on Computing
SPAID: software prefetching in pointer- and call-intensive environments

Proceedings of the 28th annual international symposium on Microarchitecture
Cache miss heuristics and preloading techniques for general-purpose programs

Proceedings of the 28th annual international symposium on Microarchitecture
Storage assignment to decrease code size

ACM Transactions on Programming Languages and Systems (TOPLAS)
Increasing cache port efficiency for dynamic superscalar microprocessors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Compiler-based prefetching for recursive data structures

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Algorithms for address assignment in DSP code generation

Proceedings of the 1996 IEEE/ACM international conference on Computer-aided design
Cache-conscious data placement

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Decoupling local variable accesses in a wide-issue superscalar processor

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Storage assignment optimizations to generate compact and efficient code on embedded DSPs

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Low-power memory mapping through reducing address bus activity

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Wattch: a framework for architectural-level power analysis and optimizations

Proceedings of the 27th annual international symposium on Computer architecture
Cache Memories

ACM Computing Surveys (CSUR)
Direct addressed caches for reduced power consumption

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Power protocol: reducing power dissipation on off-chip data buses

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Optimization opportunities created by global data reordering

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Storage assignment optimizations through variable coalescence for embedded processors

Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systems
Stack Value File: Custom Microarchitecture for the Stack

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Power-efficient prefetching via bit-differential offset assignment on embedded processors

Proceedings of the 2004 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
MicroLib: A Case for the Quantitative Comparison of Micro-Architecture Mechanisms

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
MiBench: A free, commercially representative embedded benchmark suite

WWC '01 Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop

DRAM energy reduction by prefetching-based memory traffic clustering

Proceedings of the 21st edition of the great lakes symposium on Great lakes symposium on VLSI
Link-time optimization for power efficiency in a tagless instruction cache

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization

Quantified Score

Hi-index	0.00

Visualization

Abstract

Because of stringent power constraints, aggressive latency-hiding approaches, such as prefetching, are absent in the state-of-the-art embedded processors. There are two main reasons that make prefetching power inefficient. First, compiler-inserted prefetch instructions increase code size and, therefore, could increase I-cache power. Second, inaccurate prefetching (especially for hardware prefetching) leads to high D-cache power consumption because of useless accesses. In this work, we show that it is possible to support power-efficient prefetching through bit-differential offset assignment. We target the prefetching of relocatable stack variables with a high degree of precision. By assigning the offsets of stack variables in such a way that most consecutive addresses differ by 1 bit, we can prefetch them with compact prefetch instructions to save I-cache power. The compiler first generates an access graph of consecutive memory references and then attempts a layout of the memory locations in the smallest hypercube. Each dimension of the hypercube represents a 1-bit differential addressing. The embedding is carried out in as compact a hypercube as possible in order to save memory space. Each load/store instruction carries a hint regarding prefetching the next memory reference by encoding its differential address with respect to the current one. To reduce D-cache power cost, we further attempt to assign offsets so that most of the consecutive accesses map to the same cache line. Our prefetching is done using a one entry line buffer [Wilson et al. 1996]. Consequently, many look-ups in D-cache reduce to incremental ones. This results in D-cache activity reduction and power savings. Our prefetcher requires both compiler and hardware support. In this paper, we provide implementation on the processor model close to ARM with small modification to the ISA. We tackle issues such as out-of-order commit, predication, and speculation through simple modifications to the processor pipeline on noncritical paths. Our goal in this work is to boost performance while maintaining/lowering power consumption. Our results show 12% speedup and slight power reduction. The runtime virtual space loss for stack and static data is about 11.8%.