A unified approach to eliminate memory accesses early

Authors:
Mafijul Md. Islam;Per Stenstrom
Affiliations:
Volvo Technology Corporation, Gothenburg, Sweden;Chalmers University of Technology, Gothenburg, Sweden
Venue:
CASES '11 Proceedings of the 14th international conference on Compilers, architectures and synthesis for embedded systems
Year:
2011

Citing 22
Cited 0

Dynamic instruction reuse

Proceedings of the 24th annual international symposium on Computer architecture
The filter cache: an energy efficient memory structure

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Dynamic removal of redundant computations

ICS '99 Proceedings of the 13th international conference on Supercomputing
Wattch: a framework for architectural-level power analysis and optimizations

Proceedings of the 27th annual international symposium on Computer architecture
Load and store reuse using register file contents

ICS '01 Proceedings of the 15th international conference on Supercomputing
Energy-efficient load and store reuse

ISLPED '01 Proceedings of the 2001 international symposium on Low power electronics and design
Silent Stores and Store Value Locality

IEEE Transactions on Computers
Frequent value locality and its applications

ACM Transactions on Embedded Computing Systems (TECS)
Automatically characterizing large scale program behavior

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
The MIPS R10000 Superscalar Microprocessor

IEEE Micro
Energy efficient frequent value data cache design

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Load Redundancy Removal through Instruction Reuse

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Exploiting Value Locality in Physical Register Files

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Pin: building customized program analysis tools with dynamic instrumentation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Scalable Load and Store Processing in Latency Tolerant Processors

Proceedings of the 32nd annual international symposium on Computer Architecture
Address-Indexed Memory Disambiguation and Store-to-Load Forwarding

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Fire-and-Forget: Load/Store Scheduling with No Store Queue at All

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
NoSQ: Store-Load Communication without a Store Queue

IEEE Micro
Selective writeback: reducing register file pressure and energy consumption

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Zero-content augmented caches

Proceedings of the 23rd international conference on Supercomputing
Zero-Value Caches: Cancelling Loads that Return Zero

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Characterization and exploitation of narrow-width loads: the narrow-width cache approach

CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper introduces the notion of silent loads to classify load accesses that can be satisfied by already available values of the physical register file and proposes a new architectural concept to exploit such loads. The paper then unifies different approaches of eliminating memory accesses early by contributing with a new architectural scheme. We show that our unified approach covers previously proposed techniques of exploiting forwarded and small-value loads in addition to silent loads. Forwarded loads obtain values through load-to-load and store-to-load forwarding whereas small-value loads return small values that can be coded with 8 bits or less. We find that 22%, 31% and 24% of all dynamic loads are forwarded, small-value and silent, respectively. We demonstrate that the prevalence of such loads is mostly inherent in applications. We establish that a hypothetical scheme that encompasses all the categories can eliminate as many as 42% of all dynamic loads and about 18% of all committed stores. Finally, we contribute with a new architectural technique to implement the unified scheme. We show that our proposed scheme reduces execution time to provide noticeable speedup and reduces overall energy dissipation with very low area overhead.