Compilers: principles, techniques, and tools
Compilers: principles, techniques, and tools
Optimizing stack frame accesses for processors with restricted addressing modes
Software—Practice & Experience
Storage assignment to decrease code size
ACM Transactions on Programming Languages and Systems (TOPLAS)
Increasing cache port efficiency for dynamic superscalar microprocessors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Algorithms for address assignment in DSP code generation
Proceedings of the 1996 IEEE/ACM international conference on Computer-aided design
Decoupling local variable accesses in a wide-issue superscalar processor
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Storage assignment optimizations to generate compact and efficient code on embedded DSPs
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Spanning tree based state encoding for low power dissipation
DATE '99 Proceedings of the conference on Design, automation and test in Europe
Low-power memory mapping through reducing address bus activity
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Wattch: a framework for architectural-level power analysis and optimizations
Proceedings of the 27th annual international symposium on Computer architecture
ACM Computing Surveys (CSUR)
Direct addressed caches for reduced power consumption
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Power protocol: reducing power dissipation on off-chip data buses
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Stack Value File: Custom Microarchitecture for the Stack
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
MiBench: A free, commercially representative embedded benchmark suite
WWC '01 Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop
Power-efficient prefetching for embedded processors
ACM Transactions on Embedded Computing Systems (TECS)
PRE-BUD: Prefetching for energy-efficient parallel I/O systems with buffer disks
ACM Transactions on Storage (TOS)
Practical models for energy-efficient prefetching in mobile embedded systems
Microprocessors & Microsystems
Hi-index | 0.01 |
Due to stringent power constraints, aggressive latency hiding approaches such as prefetching are absent in the state-of-the-art embedded processors. There are two main reasons that cause prefetching to be power inefficient. First, compiler inserted prefetch instructions increase code size, therefore could increase I-cache power. Secondly, inaccurate prefetching (esp. for hardware prefetching) leads to high D-cache power consumption due to the useless accesses. In this work, we show that it is possible to support power-efficient prefetching through bit-differentail offset assignment to stack variables.We target the prefetching of relocatable stack variables with a high degree of precision. By assigning the offsets of stack variables in such a way that most consecutive addresses differ by 1 bit, we can prefetch them with compact prefetch instructions to save I-cache power. The compiler first generates an access graph of consecutive memory references and then attempts a layout of the memory locations in the smallest hypercube. Each dimension of the hypercube represents a 1-bit differential addressing. The embedding is carried out in as compact a hypercube as possible in order to save memory space. Each load/store instruction carries a hint regarding prefetching the next memory reference by encoding its differential address with respect to the current one. To reduce D-cache power cost, we further attempt to assign offsets so that most of the consecutive accesses map to the same cache line. Our prefetching is done using a one entry line buffer[1]. As a consequence, many look-ups in D-cache reduce to incremental ones. This results in D-cache activity reduction and power savings.Our prefetching requires both compiler and hardware support. In this paper, we provide implementation on the ARM processor with small modification to the ARM ISA. We tackle issues about out of order commit, predication and speculation through simple modifications to the processor pipeline on non-critical paths. Our goal in this work is to boost performance while maintaining/lowering power consumption. Our results show 12% speed-up and slightly lower power consumption.