Reducing Ownership Overhead for Load-Store Sequences in Cache-Coherent Multiprocessors

Authors:
Affiliations:
Venue:
IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Year:
2000

Citing 0
Cited 4

The Use of Prediction for Accelerating Upgrade Misses in cc-NUMA Multiprocessors

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors

Proceedings of the 30th annual international symposium on Computer architecture
Profiling of parallel processing programs on shared memory multiprocessors using Simics

ACM SIGARCH Computer Architecture News - Special issue on the 2005 workshop on binary instrumentation and application
An adaptive cache coherence protocol for chip multiprocessors

Proceedings of the Second International Forum on Next-Generation Multicore/Manycore Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Parallel programs that modify shared data in a cache-coherent multiprocessor with a write-invalidate coherence protocol create ownership overhead in the form of ownership acquisitions at writes to shared data. This can have a significant impact on performance in a cache-coherent non-uniform memory architecture (NUMA) multiprocessor. By combining a read-request and an ownership acquisition, the write latency and network traffic can potentially be reduced.In this paper, we propose a new hardware-based approach for performing this optimization by targeting {load-store} sequences, which we show is a super-set of migratory sharing. A load-store sequence consists of a global read request followed by a global write action to the same memory location from the same processor, without any intervening access to the same block from any other processor.We use detailed simulation with four benchmark programs including one on-line transaction processing workload and operating system execution to examine the effectiveness of the proposed technique. The results show that the technique is able to reduce write-related latency and network traffic more than previous hardware-based techniques, up to twice as much.