Performance Evaluation and Cost Analysis of Cache Protocol Extensions for Shared-Memory Multiprocessors

Authors:
Fredrik Dahlgren;Michel Dubois;Per Stenström
Affiliations:
Chalmers Univ. of Technology, Gothenburg, Sweden;Univ. of Southern California, Los Angeles;Chalmers Univ. of Technology, Gothenburg, Sweden
Venue:
IEEE Transactions on Computers
Year:
1998

Citing 16
Cited 2

A characterization of sharing in parallel programs and its application to coherency protocol evaluation

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Memory Access Dependencies in Shared-Memory Multiprocessors

IEEE Transactions on Software Engineering
Performance evaluation of memory consistency models for shared-memory multiprocessors

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Tolerating latency through software-controlled prefetching in shared-memory multiprocessors

Journal of Parallel and Distributed Computing - Special issue on shared-memory multiprocessors
Comparative evaluation of latency reducing and tolerating techniques

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
SPLASH: Stanford parallel applications for shared-memory

ACM SIGARCH Computer Architecture News
Hiding memory latency using dynamic scheduling in shared-memory multiprocessors

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Comparative performance evaluation of cache-coherent NUMA and COMA architectures

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Adaptive cache coherency for detecting migratory shared data

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
An adaptive cache coherence protocol optimized for migratory sharing

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Combined performance gains of simple cache protocol extensions

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Using write caches to improve performance of cache coherence protocols in shared-memory multiprocessors

Journal of Parallel and Distributed Computing
Implementation and evaluation of update-based cache protocols under relaxed memory consistency models

Future Generation Computer Systems
Evaluation of Hardware-Based Stride and Sequential Prefetching in Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Sequential Hardware Prefetching in Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
The Impact of Instruction-Level Parallelism on Multiprocessor Performance and Simulation Methodology

HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture

Merging, sorting and matrix operations on the SOME-bus multiprocessor architecture

Future Generation Computer Systems - Special issue: Advanced services for clusters and internet computing
81.6 GOPS object recognition processor based on a memory-centric NoC

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Quantified Score

Hi-index	14.98

Visualization

Abstract

We evaluate three extensions to directory-based cache coherence protocols in shared-memory multiprocessors. These extensions are aimed at reducing the penalties associated with memory accesses and include a hardware prefetching scheme, a migratory sharing optimization, and a competitive-update mechanism. Since each extension targets distinct components of the read and write penalties, they can be combined effectively. This paper identifies the combinations yielding the best performance gains and cost trade-offs in the context of a class of cache-coherent NUMA (Non-Uniform Memory Access) architectures. Detailed architectural simulations of a multiprocessor with single-issue, statically scheduled CPUs, using five benchmarks, show that the protocol extensions often provide additive gains when they are properly combined. For example, the combination of prefetching with the competitive-update mechanism speeds up the execution by nearly a factor of two under release consistency. The same speedup is obtained under sequential consistency by combining prefetching with the migratory sharing optimization. This paper shows that a basic write-invalidate protocol augmented by appropriate extensions can eliminate most memory access penalties without any support from the programmer or the compiler.