Silent Stores and Store Value Locality

Authors:
Kevin M. Lepak;Gordon B. Bell;Mikko H. Lipasti
Affiliations:
Univ. of Wisconsin, Madison;Univ. of Wisconsin, Madison;Univ. of Wisconsin, Madison
Venue:
IEEE Transactions on Computers
Year:
2001

Citing 20
Cited 16

Computer architecture: a quantitative approach

Computer architecture: a quantitative approach
The detection and elimination of useless misses in multiprocessors

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Internal organization of the Alpha 21164, a 300-MHz 64-bit quad-issue CMOS RISC microprocessor

Digital Technical Journal - Special 10th anniversary issue
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Value locality and load value prediction

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Exceeding the dataflow limit via value prediction

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Dynamic instruction reuse

Proceedings of the 24th annual international symposium on Computer architecture
Value profiling

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Modern compiler implementation in Java

Modern compiler implementation in Java
Selective value prediction

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
On the value locality of store instructions

Proceedings of the 27th annual international symposium on Computer architecture
Eager writeback - a technique for improving bandwidth utilization

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Silent stores for free

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Reducing Memory Traffic Via Redundant Store Instructions

HPCN Europe '99 Proceedings of the 7th International Conference on High-Performance Computing and Networking
An architectural alternative to optimizing compilers

ASPLOS I Proceedings of the first international symposium on Architectural support for programming languages and operating systems
Improving CC-NUMA Performance Using Instruction-Based Prediction

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Control-Flow Speculation through Value Prediction for Superscalar Processors

PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
Characterization of Silent Stores

PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
An Architectural Evaluation of Java TPC-W

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Memory dependence prediction

Memory dependence prediction

Temporally silent stores

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Using thread-level speculation to simplify manual parallelization

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Fast Secure Processor for Inhibiting Software Piracy and Tampering

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Improving Memory Encryption Performance in Secure Processors

IEEE Transactions on Computers
Exposing speculative thread parallelism in SPEC2000

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
SlicK: slice-based locality exploitation for efficient redundant multithreading

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
On the performance potential of different types of speculative thread-level parallelism: The DL version of this paper includes corrections that were not made available in the printed proceedings

Proceedings of the 20th annual international conference on Supercomputing
Zero loads: canceling load requests by tracking zero values

Proceedings of the 9th workshop on MEmory performance: DEaling with Applications, systems and architecture
LIRAC: using live range information to optimize memory access

ARCS'07 Proceedings of the 20th international conference on Architecture of computing systems
Characterization and exploitation of narrow-width loads: the narrow-width cache approach

CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
Simulating a LAGS processor to consider variable latency on L1 D-Cache

Proceedings of the 2010 Summer Computer Simulation Conference
A unified approach to eliminate memory accesses early

CASES '11 Proceedings of the 14th international conference on Compilers, architectures and synthesis for embedded systems
Power-Aware processors for wireless sensor networks

ISCIS'06 Proceedings of the 21st international conference on Computer and Information Sciences
Reducing energy dissipation of wireless sensor processors using silent-store-filtering motecache

PATMOS'06 Proceedings of the 16th international conference on Integrated Circuit and System Design: power and Timing Modeling, Optimization and Simulation
SI-TM: reducing transactional memory abort rates through snapshot isolation

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Removal of Conflicts in Hardware Transactional Memory Systems

International Journal of Parallel Programming

Quantified Score

Hi-index	14.98

Visualization

Abstract

Value locality, a recently discovered program attribute that describes the likelihood of the recurrence of previously seen program values, has been studied enthusiastically in the recent published literature. Much of the energy has focused on refining the initial efforts at predicting load instruction outcomes, with the balance of the effort examining the value locality of either all register-writing instructions or a focused subset of them. Surprisingly, there has been very little published characterization of or effort to exploit the value locality of data words stored to memory by computer programs. This paper presents such a characterization, including detailed source-level analysis of the causes of silent stores, proposes both memory-centric (based on message passing) and producer-centric (based on program structure) prediction mechanisms for stored data values, introduces the concept of silent stores and new definitions of multiprocessor false sharing based on these observations, and suggests new techniques for aligning cache coherence protocols and microarchitectural store handling techniques to exploit the value locality of stores. We find that realistic implementations of these techniques can significantly reduce multiprocessor data bus traffic and are more effective at reducing address bus traffic than the addition of Exclusive state to a MSI coherence protocol. We also show that squashing of silent stores can provide uniprocessor speedups greater than the addition of store-to-load forwarding.