Performance evaluation of hybrid hardware and software distributed shared memory protocols

Authors:
Rohit Chandra;Kourosh Gharachorloo;Vijayaraghavan Soundararajan;Anoop Gupta
Affiliations:
Computer Systems Laboratory, Stanford University, CA;Computer Systems Laboratory, Stanford University, CA;Computer Systems Laboratory, Stanford University, CA;Computer Systems Laboratory, Stanford University, CA
Venue:
ICS '94 Proceedings of the 8th international conference on Supercomputing
Year:
1994

Citing 32
Cited 7

Cache coherence protocols: evaluation using a multiprocessor simulation model

ACM Transactions on Computer Systems (TOCS)
Memory access buffering in multiprocessors

ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Fine-grained mobility in the Emerald system

ACM Transactions on Computer Systems (TOCS)
Distributed programming in Argus

Communications of the ACM
The effect of sharing on the cache and bus performance of parallel programs

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
The Amber system: parallel programming on a network of multiprocessors

SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
Memory coherence in shared virtual memory systems

ACM Transactions on Computer Systems (TOCS)
LimitLESS directories: A scalable cache coherence scheme

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Performance evaluation of memory consistency models for shared-memory multiprocessors

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Techniques for improving the performance of sparse matrix factorization on multiprocessor workstations

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Tolerating latency through software-controlled prefetching in shared-memory multiprocessors

Journal of Parallel and Distributed Computing - Special issue on shared-memory multiprocessors
Implementation and performance of Munin

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
The Stanford Dash Multiprocessor

Computer
SPLASH: Stanford parallel applications for shared-memory

ACM SIGARCH Computer Architecture News
Lazy release consistency for software distributed shared memory

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Comparative performance evaluation of cache-coherent NUMA and COMA architectures

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
T: a multithreaded massively parallel architecture

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Adjustable block size coherent caches

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Cache Invalidation Patterns in Shared-Memory Multiprocessors

IEEE Transactions on Computers
DDM: A Cache-Only Memory Architecture

Computer
Distributed shared memory with versioned objects

OOPSLA '92 conference proceedings on Object-oriented programming systems, languages, and applications
Integrating message-passing and shared-memory: early experience

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
The shared regions approach to software cache coherence on multiprocessors

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Evaluation of release consistent software distributed shared memory on emerging network technology

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Anatomy of a message in the Alewife multiprocessor

ICS '93 Proceedings of the 7th international conference on Supercomputing
The accuracy of trace-driven simulations of multiprocessors

SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Weak ordering—a new definition

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Memory consistency and event ordering in scalable shared-memory multiprocessors

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
APRIL: a processor architecture for multiprocessing

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
The directory-based cache coherence protocol for the DASH multiprocessor

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Jade: A High-Level, Machine-Independent Language for Parallel Programming

Computer

An analytic study of dynamic hardware and software cache coherence strategies

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
CRL: high-performance all-software distributed shared memory

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Speeding up irregular applications in shared-memory multiprocessors: memory binding and group prefetching

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
MGS: a multigrain shared memory system

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
SoftFLASH: analyzing the performance of clustered distributed virtual shared memory

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Ace: linguistic mechanisms for customizable protocols

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Ace: a language for parallel programming with customizable protocols

ACM Transactions on Computer Systems (TOCS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Hardware distributed shared memory (DSM) systems efficiently support fine grain sharing of data by maintaining coherence at the level of individual cache lines and providing automatic replication in processor caches. Software DSM systems, on the other hand, amortize high communication costs by maintaining coherence at coarser granularities and replicating data at the level of local main memories. Even though software DSM systems have traditionally been targeted towards loosely coupled environments, some of the techniques are potentially useful in the context of tightly coupled multiprocessors. In particular, communicating data at a coarse grain can sometimes be more efficient than transferring the data as individual cache lines. Furthermore, replication in local memories can accommodate applications with larger working sets as compared to replication in processor caches only. Therefore, combining the two techniques in a hybrid protocol can potentially exploit the benefits of each approach.This paper proposes one such hybrid protocol and evaluates its performance in the context of the FLASH multiprocessor architecture. The hybrid system allows the programmer to optionally identify regions of data shared at a coarse granularity. Coherence for such data is maintained at the grain of the entire region using a software-DSM-style protocol. We evaluate the performance gains of this approach through a detailed simulation study of several parallel applications. Our preliminary results show that the hybrid protocol can eliminate a substantial fraction of remote cache misses through bulk transfer of coarse grain data regions and replication of such data in local memories. The performance gains over hardware cache coherence are modest at low network latencies, but increase substantially at higher network latencies and processor speeds. Finally, we show that similar to cache-only memory architectures, the hybrid protocol is insensitive to data placement issues.