Exploring, defining, and exploiting recent store value locality

Authors:
Kevin M. Lepak;Mikko H. Lipasti
Affiliations:
-;-
Venue:
Exploring, defining, and exploiting recent store value locality
Year:
2003

Citing 0
Cited 2

Verification of chip multiprocessor memory systems using a relaxed scoreboard

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Edge chasing delayed consistency: pushing the limits of weak memory models

Proceedings of the 2012 ACM workshop on Relaxing synchronization for multicore and manycore scalability

Quantified Score

Hi-index	0.00

Visualization

Abstract

This thesis is motivated by the growing differential between main memory and microprocessor core performance. Increased integration, enabled by Moore's law, has provided a substantial compound improvement in core performance. Integration has benefitted main memory latency less significantly, leading to an expanding memory-gap. Furthermore, in multiprocessors, increasing integration has allowed enlarging on-chip cache structures to continue reducing capacity and conflict misses; however, communication misses still remain, limiting performance of multithreaded workloads. Locality in both temporal and spatial dimensions has been exploited historically by computer architects to improve memory system performance. Recently, a new locality dimension has emerged unveiling additional potential for performance improvement. Value locality describes a program behavior phenomenon in which values recur in programs. Many researchers have examined value locality as a means to improve memory system performance. However, most research has focused on predicting load values, as it is believed that loads are latency critical. In contrast, conventional wisdom says stores are not latency critical and need only be buffered and forwarded for acceptable performance. In this thesis, we show that stores should be examined as a means of improving memory performance for both uniprocessors and multiprocessors and that stores exhibit significant value locality. For example, approximately 40% of stores are update silent; they write the same value which already exists at the memory location, thus contributing no change in system state. We show numerous methods of exploiting store value locality to increase performance. In uniprocessors, we detail improvements in core efficiency; in multiprocessors, significant reductions in communication between processors. We focus predominantly on multiprocessors, making a fundamental contribution in redefining multiprocessor sharing to consider two dimensions of store value locality. Furthermore, we describe both speculative and non-speculative methods which achieve substantial performance benefit by exploiting store value locality in both scientific and commercial workloads. Many of our proposals can be integrated into existing microprocessor designs with coherence protocol changes, while others rely on existing coherence mechanisms to reap tangible benefit. We perform a detailed performance evaluation, using full-system, execution-driven, simulation to show the merits of different designs.