Multigrain shared memory

Authors:
Donald Yeung;John Kubiatowicz;Anant Agarwal
Affiliations:
Univ. of Maryland at College Park, College Park;Univ. of California at Berkeley, Berkeley;Massachusetts Institute of Technology, Cambridge
Venue:
ACM Transactions on Computer Systems (TOCS)
Year:
2000

Citing 28
Cited 4

TLB consistency on highly-parallel shared-memory multiprocessors

Proceedings of the Twenty-First Annual Hawaii International Conference on Architecture Track
Translation lookaside buffer consistency: a software approach

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Low-synchronization translation lookaside buffer consistency in large-scale shared-memory multiprocessors

SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
Memory coherence in shared virtual memory systems

ACM Transactions on Computer Systems (TOCS)
Run-Time Parallelization and Scheduling of Loops

IEEE Transactions on Computers
LimitLESS directories: A scalable cache coherence scheme

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Implementation and performance of Munin

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
Lazy release consistency for software distributed shared memory

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Integrating message-passing and shared-memory: early experience

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Anatomy of a message in the Alewife multiprocessor

ICS '93 Proceedings of the 7th international conference on Supercomputing
Software versus hardware shared-memory implementation: a case study

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Efficient support for irregular applications on distributed-memory machines

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
The MIT Alewife machine: architecture and performance

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
U-Net: a user-level network interface for parallel and distributed computing

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
CRL: high-performance all-software distributed shared memory

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
MGS: a multigrain shared memory system

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Shasta: a low overhead, software-only approach for supporting fine-grain shared memory

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
SoftFLASH: analyzing the performance of clustered distributed virtual shared memory

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems

OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture
Cashmere-2L: software coherent shared memory on a clustered remote-write network

Proceedings of the sixteenth ACM symposium on Operating systems principles
Performance Evaluation of a Cluster-Based Multiprocessor Built from ATM Switches and Bus-Based Multiprocessor Servers

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Scheduling Communication on an SMP Node Parallel Machine

HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Home-Based SVM Protocols for SMP Clusters: Design and Performance

HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
Fine-Grain Software Distributed Shared Memory on SMP Clusters

HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
SPLASH: Stanford parallel applications for shared-memory*

SPLASH: Stanford parallel applications for shared-memory*
Multigrain shared memory

Multigrain shared memory

Shared Virtual Memory Clusters with Next-Generation Interconnection Networks and Wide Compute Nodes

HiPC '01 Proceedings of the 8th International Conference on High Performance Computing
Using System Emulation to Model Next-Generation Shared Virtual Memory Clusters

Cluster Computing
Circulating shared-registers for multiprocessor systems

Journal of Systems Architecture: the EUROMICRO Journal
Exploiting locality: a flexible DSM approach

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Parallel workstations, each comprising tens of processors based on shared memory, promise cost-effective scalable multiprocessing. This article explores the coupling of such small- to medium-scale shared-memory multiprocessors through software over a local area network to synthesize larger shared-memory systems. We call these systems Distributed Shared-memory MultiProcessors (DSMPs). This article introduces the design of a shared-memory system that uses multiple granularities of sharing, called MGS, and presents a prototype implementation of MGS on the MIT Alewife multiprocessor. Multigrain shared memory enables the collaboration of hardware and software shared memory, thus synthesizing a single transparent shared-memory address space across a cluster of multiprocessors. The system leverages the efficient support for fine-grain cache-line sharing within multiprocessor nodes as often as possible, and resorts to coarse-grain page-level sharing across nodes only when absolutely necessary. Using our prototype implementation of MGS, an in-depth study of several shared-memory application is conducted to understand the behavior of DSMPs. Our study is the first to comprehensively explore the DSMP design space, and teh compare the performance of DSMPs against all-software and all-hardware DSMs on a signle experimental platform. Keeping the total number of processors fixed, we show that applications execute up to 85% faster on a DSMP as compared to an all-software DSM. We also show that all-hardware DSMs hold a significant performance advantage over DSMPs on challenging applications, between 159% and 1014%. However, program transformations to improve data locality for these applications allow DSMPs to almost match the performance of an all-hardware multiprocessor of the same size.