Computer architecture: a quantitative approach
Computer architecture: a quantitative approach
The detection and elimination of useless misses in multiprocessors
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Internal organization of the Alpha 21164, a 300-MHz 64-bit quad-issue CMOS RISC microprocessor
Digital Technical Journal - Special 10th anniversary issue
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Value locality and load value prediction
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Exceeding the dataflow limit via value prediction
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Proceedings of the 24th annual international symposium on Computer architecture
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Modern compiler implementation in Java
Modern compiler implementation in Java
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
On the value locality of store instructions
Proceedings of the 27th annual international symposium on Computer architecture
Eager writeback - a technique for improving bandwidth utilization
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Reducing Memory Traffic Via Redundant Store Instructions
HPCN Europe '99 Proceedings of the 7th International Conference on High-Performance Computing and Networking
An architectural alternative to optimizing compilers
ASPLOS I Proceedings of the first international symposium on Architectural support for programming languages and operating systems
Improving CC-NUMA Performance Using Instruction-Based Prediction
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Control-Flow Speculation through Value Prediction for Superscalar Processors
PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
Characterization of Silent Stores
PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
An Architectural Evaluation of Java TPC-W
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Memory dependence prediction
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Using thread-level speculation to simplify manual parallelization
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Fast Secure Processor for Inhibiting Software Piracy and Tampering
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Improving Memory Encryption Performance in Secure Processors
IEEE Transactions on Computers
Exposing speculative thread parallelism in SPEC2000
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
SlicK: slice-based locality exploitation for efficient redundant multithreading
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Proceedings of the 20th annual international conference on Supercomputing
Zero loads: canceling load requests by tracking zero values
Proceedings of the 9th workshop on MEmory performance: DEaling with Applications, systems and architecture
LIRAC: using live range information to optimize memory access
ARCS'07 Proceedings of the 20th international conference on Architecture of computing systems
Characterization and exploitation of narrow-width loads: the narrow-width cache approach
CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
Simulating a LAGS processor to consider variable latency on L1 D-Cache
Proceedings of the 2010 Summer Computer Simulation Conference
A unified approach to eliminate memory accesses early
CASES '11 Proceedings of the 14th international conference on Compilers, architectures and synthesis for embedded systems
Power-Aware processors for wireless sensor networks
ISCIS'06 Proceedings of the 21st international conference on Computer and Information Sciences
Reducing energy dissipation of wireless sensor processors using silent-store-filtering motecache
PATMOS'06 Proceedings of the 16th international conference on Integrated Circuit and System Design: power and Timing Modeling, Optimization and Simulation
SI-TM: reducing transactional memory abort rates through snapshot isolation
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Removal of Conflicts in Hardware Transactional Memory Systems
International Journal of Parallel Programming
Hi-index | 14.98 |
Value locality, a recently discovered program attribute that describes the likelihood of the recurrence of previously seen program values, has been studied enthusiastically in the recent published literature. Much of the energy has focused on refining the initial efforts at predicting load instruction outcomes, with the balance of the effort examining the value locality of either all register-writing instructions or a focused subset of them. Surprisingly, there has been very little published characterization of or effort to exploit the value locality of data words stored to memory by computer programs. This paper presents such a characterization, including detailed source-level analysis of the causes of silent stores, proposes both memory-centric (based on message passing) and producer-centric (based on program structure) prediction mechanisms for stored data values, introduces the concept of silent stores and new definitions of multiprocessor false sharing based on these observations, and suggests new techniques for aligning cache coherence protocols and microarchitectural store handling techniques to exploit the value locality of stores. We find that realistic implementations of these techniques can significantly reduce multiprocessor data bus traffic and are more effective at reducing address bus traffic than the addition of Exclusive state to a MSI coherence protocol. We also show that squashing of silent stores can provide uniprocessor speedups greater than the addition of store-to-load forwarding.