Software versus hardware shared-memory implementation: a case study

Authors:
A. L. Cox;S. Dwarkadas;P. Keleher;H. Lu;R. Rajamony;W. Zwaenepoel
Affiliations:
Department of Computer Science, Rice University;Department of Computer Science, Rice University;Department of Computer Science, Rice University;Department of Computer Science, Rice University;Department of Computer Science, Rice University;Department of Computer Science, Rice University
Venue:
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Year:
1994

Citing 16
Cited 32

Portable programs for parallel processors

Portable programs for parallel processors
A cache coherence scheme with fast selective invalidation

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Evaluating the performance of software cache coherence

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Memory coherence in shared virtual memory systems

ACM Transactions on Computer Systems (TOCS)
LimitLESS directories: A scalable cache coherence scheme

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Implementation and performance of Munin

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
The Stanford Dash Multiprocessor

Computer
Lazy release consistency for software distributed shared memory

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Cooperative shared memory: software and hardware for scalable multiprocessor

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
The Peregrine high-performance RPC system

Software—Practice & Experience
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Operating system support for modern memory hierarchies

Operating system support for modern memory hierarchies
Memory consistency and event ordering in scalable shared-memory multiprocessors

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
A Unified Formalization of Four Shared-Memory Models

IEEE Transactions on Parallel and Distributed Systems
A low-overhead coherence solution for multiprocessors with private cache memories

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
SPLASH: Stanford parallel applications for shared-memory

SPLASH: Stanford parallel applications for shared-memory

LCM: memory system support for parallel language implementation

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Fine-grain access control for distributed shared memory

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Efficient support for irregular applications on distributed-memory machines

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Higher-order distributed objects

ACM Transactions on Programming Languages and Systems (TOPLAS)
A comprehensive bibliography of distributed shared memory

ACM SIGOPS Operating Systems Review
The interaction of parallel and sequential workloads on a network of workstations

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
CRL: high-performance all-software distributed shared memory

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
MGS: a multigrain shared memory system

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Understanding application performance on shared virtual memory systems

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
SoftFLASH: analyzing the performance of clustered distributed virtual shared memory

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Fine grain parallel communication on general purpose LANs

ICS '96 Proceedings of the 10th international conference on Supercomputing
Towards transparent and efficient software distributed shared memory

Proceedings of the sixteenth ACM symposium on Operating systems principles
Cashmere-2L: software coherent shared memory on a clustered remote-write network

Proceedings of the sixteenth ACM symposium on Operating systems principles
Evaluation of hardware write propagation support for next-generation shared virtual memory clusters

ICS '98 Proceedings of the 12th international conference on Supercomputing
Adapting the Network Interface for High-Performance Computing: The CNI Approach

The Journal of Supercomputing - Special issue: high performance distributed computing
Predicting the performance of distributed virtual shared-memory applications

IBM Systems Journal
Shared virtual memory with automatic update support

ICS '99 Proceedings of the 13th international conference on Supercomputing
The scalability of multigrain systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
Comparative study of page-based and segment-based software DSM through compiler optimization

Proceedings of the 14th international conference on Supercomputing
Multigrain shared memory

ACM Transactions on Computer Systems (TOCS)
The Cost of Communication Protocols and Coordination Languages in Embedded Systems

COORDINATION '02 Proceedings of the 5th International Conference on Coordination Models and Languages
A Low-Complexity Parallel System of Gracious, Scalable Performance Case Study forNear PetaFLOPS Computing

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Improving Release-Consistent Shared Virtual Memory using Automatic Update

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Performance Evaluation of a Cluster-Based Multiprocessor Built from ATM Switches and Bus-Based Multiprocessor Servers

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
CNI: A High-Performance Network Interface for Workstation Clusters

HPDC '96 Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing
A Performance Debugger for Eliminating Excess Synchronization in Shared-Memory Parallel Programs

MASCOTS '96 Proceedings of the 4th International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems
Improving the Performance of Software Distributed Shared Memory with Speculation

IEEE Transactions on Parallel and Distributed Systems
Shared memory computing on clusters with symmetric multiprocessors and system area networks

ACM Transactions on Computer Systems (TOCS)
The design and evaluation of a shared object system for distributed memory machines

OSDI '94 Proceedings of the 1st USENIX conference on Operating Systems Design and Implementation
Software distributed shared memory over virtual interface architecture: implementation and performance

ALS'00 Proceedings of the 4th annual Linux Showcase & Conference - Volume 4
Implementing an OpenMP execution environment on InfiniBand clusters

IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
Searching an optimal history size for history-based page prefetching on software DSM systems

HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications

Quantified Score

Hi-index	0.00

Visualization

Abstract

We compare the performance of software-supported shared memory on a general-purpose network to hardware-supported shared memory on a dedicated interconnect.Up to eight processors, our results are based on the execution of a set of application programs on a SGI 4D/480 multiprocessor and on TreadMarks, a distributed shared memory system that runs on a Fore ATM LAN of DECstation-5000/240s. Since the DECstation and the 4D/480 use the same processor, primary cache, and compiler, the shared-memory implementation is the principal difference between the systems. Our results show that TreadMarks performs comparably to the 4D/480 for applications with moderate amounts of synchronization, but the difference in performance grows as the synchronization frequency increases. For applications that require a large amount of memory bandwidth, TreadMarks can perform better than the SGI 4D/480.Beyond eight processors, our results are based on execution-driven simulation. Specifically, we compare a software implementation on a general-purpose network of uniprocessor nodes, a hardware implementation using a directory-based protocol on a dedicated interconnect, and a combined implementation using software to provide shared memory between multiprocessor nodes with hardware implementing shared memory within a node. For the modest size of the problems that we can simulate, the hardware implementation scales well and the software implementation scales poorly. The combined approach delivers performance close to that of the hardware implementation for applications with small to moderate synchronization rates and good locality. Reductions in communication overhead improve the performance of the software and the combined approach, but synchronization remains a bottleneck.