Shasta: a low overhead, software-only approach for supporting fine-grain shared memory

Authors:
Daniel J. Scales;Kourosh Gharachorloo;Chandramohan A. Thekkath
Affiliations:
Western Research Laboratory, Digital Equipment Corporation;Western Research Laboratory, Digital Equipment Corporation;Systems Research Center, Digital Equipment Corporation
Venue:
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Year:
1996

Citing 19
Cited 93

A “card-marking” scheme for controlling intergenerational references in generation-based garbage collection on stock hardware

ACM SIGPLAN Notices
Memory coherence in shared virtual memory systems

ACM Transactions on Computer Systems (TOCS)
Implementation and performance of Munin

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
Orca: A Language for Parallel Programming of Distributed Systems

IEEE Transactions on Software Engineering
SPLASH: Stanford parallel applications for shared-memory

ACM SIGARCH Computer Architecture News
Parallel programming in Split-C

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
ATOM: a system for building customized program analysis tools

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Fine-grain access control for distributed shared memory

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Software caching and computation migration in Olden

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
CRL: high-performance all-software distributed shared memory

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Decoupled hardware support for distributed shared memory

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
MGS: a multigrain shared memory system

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Informing memory operations: providing memory performance feedback in modern processors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
SoftFLASH: analyzing the performance of clustered distributed virtual shared memory

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Memory consistency and event ordering in scalable shared-memory multiprocessors

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Cid: A Parallel, "Shared-Memory" C for Distributed-Memory Machines

LCPC '94 Proceedings of the 7th International Workshop on Languages and Compilers for Parallel Computing
START-NG: Delivering Seamless Parallel Computing

Euro-Par '95 Proceedings of the First International Euro-Par Conference on Parallel Processing
Overview of memory channel network for PCI

COMPCON '96 Proceedings of the 41st IEEE International Computer Conference

Temporal notions of synchronization and consistency in Beehive

Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Design and performance of the Shasta distributed shared memory protocol

ICS '97 Proceedings of the 11th international conference on Supercomputing
Ace: linguistic mechanisms for customizable protocols

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimizing communication in HPF programs on fine-grain distributed shared memory

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Relaxed consistency and coherence granularity in DSM systems: a performance evaluation

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Application restructuring and performance portability on shared virtual memory and hardware-coherent multiprocessors

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Shared-memory performance profiling

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
VM-based shared memory on low-latency, remote-memory-access networks

Proceedings of the 24th annual international symposium on Computer architecture
Efficient synchronization: let them eat QOLB

Proceedings of the 24th annual international symposium on Computer architecture
Eraser: a dynamic data race detector for multithreaded programs

ACM Transactions on Computer Systems (TOCS)
Eraser: a dynamic data race detector for multi-threaded programs

Proceedings of the sixteenth ACM symposium on Operating systems principles
Towards transparent and efficient software distributed shared memory

Proceedings of the sixteenth ACM symposium on Operating systems principles
Cashmere-2L: software coherent shared memory on a clustered remote-write network

Proceedings of the sixteenth ACM symposium on Operating systems principles
Performance evaluation of the Orca shared-object system

ACM Transactions on Computer Systems (TOCS)
Evaluation of hardware write propagation support for next-generation shared virtual memory clusters

ICS '98 Proceedings of the 12th international conference on Supercomputing
Protocol-based data-race detection

SPDT '98 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
Predicting the performance of distributed virtual shared-memory applications

IBM Systems Journal
Hardware Support for Flexible Distributed Shared Memory

IEEE Transactions on Computers
The design, implementation, and evaluation of Jade

ACM Transactions on Programming Languages and Systems (TOPLAS)
A task- and data-parallel programming language based on shared objects

ACM Transactions on Programming Languages and Systems (TOPLAS)
MultiView and Millipage — fine-grain sharing in page-based DSMs

OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
Maps: a compiler-managed memory system for raw machines

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Space-time memory: a parallel programming abstraction for interactive multimedia applications

Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Responsiveness without interrupts

ICS '99 Proceedings of the 13th international conference on Supercomputing
Object views: language support for intelligent object caching in parallel and distributed computations

Proceedings of the 14th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Ace: a language for parallel programming with customizable protocols

ACM Transactions on Computer Systems (TOCS)
A high-level abstraction of shared accesses

ACM Transactions on Computer Systems (TOCS)
Comparative study of page-based and segment-based software DSM through compiler optimization

Proceedings of the 14th international conference on Supercomputing
Multigrain shared memory

ACM Transactions on Computer Systems (TOCS)
Architecture and design of AlphaServer GS320

ACM SIGPLAN Notices
Scalable fault-tolerant distributed shared memory

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Improving fine-grained irregular shared-memory benchmarks by data reordering

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Runtime optimizations for a Java DSM implementation

Proceedings of the 2001 joint ACM-ISCOPE conference on Java Grande
Architecture and design of AlphaServer GS320

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Source-level global optimizations for fine-grain distributed shared memory systems

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Object race detection

OOPSLA '01 Proceedings of the 16th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Reducing coherence overhead of barrier synchronization in software DSMs

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Lazy Garbage Collection of Recovery State for Fault-Tolerant Distributed Shared Memory

IEEE Transactions on Parallel and Distributed Systems
Removing the overhead from software-based shared memory

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Run-time support for distributed sharing in safe languages

ACM Transactions on Computer Systems (TOCS)
Low Latency High Bandwidth Message Transfer Mechanisms for a Network Interface Plugged into a Memory Slot

Cluster Computing
Mondrian memory protection

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Y-Invalidate: A New Protocol for Implementing Weak Consistency in DSM Systems

International Journal of Parallel Programming
Baring It All to Software: Raw Machines

Computer
Lazy Garbage Collection of Recovery State for Fault-Tolerant Distributed Shared Memory

IEEE Transactions on Parallel and Distributed Systems
GENESIS: an efficient, transparent and easy to use cluster operating system

Parallel Computing
A Fully Compliant OpenMP Implementationon Software Distributed Shared Memory

HiPC '02 Proceedings of the 9th International Conference on High Performance Computing
Hardware Versus Software Implementation of COMA

ICPP '97 Proceedings of the international Conference on Parallel Processing
Agent-Based Distributed Computing with JMessengers

IICS '01 Proceedings of the International Workshop on Innovative Internet Computing Systems
Improving Load Balancing in a Parallel Cluster Environment Using Mobile Agents

HPCN Europe 2001 Proceedings of the 9th International Conference on High-Performance Computing and Networking
Enhancing Software DSM for Compiler-Parallelized Applications

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Message Passing Vs. Shared Address Space on a Clusters of SMPs

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Efficient Fine-Grain Sharing Support for Software DSMs Through Segmentation

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Transparent Adaptation of Sharing Granularity in MultiView-Based DSM Systems

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Evaluating the DSMIO Cache-Coherence Algorithm in Cluster-Based Parallel ODBMS

OOIS '02 Proceedings of the 8th International Conference on Object-Oriented. Information Systems
OMPC++ - A Portable High-Performance Implementation of DSM using OpenC++ Reflection

Reflection '99 Proceedings of the Second International Conference on Meta-Level Architectures and Reflection
Compilation and Runtime-Optimizations for Software Distributed Shared Memory

LCR '00 Selected Papers from the 5th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
Run-Time Support for Distributed Sharing in Typed Languages

LCR '00 Selected Papers from the 5th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
The Efeect of Contention on the Scalability of Page-Based Software Shared Memory Systems

LCR '00 Selected Papers from the 5th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
Processor Mechanisms for Software Shared Memory

ISHPC '00 Proceedings of the Third International Symposium on High Performance Computing
Active Memory Clusters: Efficient Multiprocessing on Commodity Clusters

ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
Message passing and shared address space parallelism on an SMP cluster

Parallel Computing
Parallelizing Applications into Silicon

FCCM '99 Proceedings of the Seventh Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Evaluation of Compiler-Assisted Software DSM Schemes for a Workstation Cluster

IWIA '99 Proceedings of the 1999 International Workshop on Innovative Architecture
DISE: a programmable macro engine for customizing applications

Proceedings of the 30th annual international symposium on Computer architecture
Locality and Performance of Page- and Object-Based DSMs

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Parallel Tree Building on a Range of Shared Address Space Multiprocessors: Algorithms and Application Performance

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
On the design of global object space for efficient multi-threading Java computing on clusters

Parallel Computing - Special issue: Parallel and distributed scientific and engineering computing
A comparison of sequential consistency with home-based lazy release consistency for software distributed shared memory

Proceedings of the 18th annual international conference on Supercomputing
CAS-DSM: a compiler assisted software distributed shared memory

International Journal of Parallel Programming
A comparative evaluation of hardware-only and software-only directory protocols in shared-memory multiprocessors

Journal of Systems Architecture: the EUROMICRO Journal
Performance analysis of methods that overcome false sharing effects in software DSMs

Journal of Parallel and Distributed Computing
Shared memory computing on clusters with symmetric multiprocessors and system area networks

ACM Transactions on Computer Systems (TOCS)
Exploiting distributed version concurrency in a transactional memory cluster

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Tradeoffs in transactional memory virtualization

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
TMA: a trap-based memory architecture

Proceedings of the 20th annual international conference on Supercomputing
The region trap library: handling traps on application-defined regions of memory

ATEC '99 Proceedings of the annual conference on USENIX Annual Technical Conference
Automatic nonblocking communication for partitioned global address space programs

Proceedings of the 21st annual international conference on Supercomputing
The case for simple, visible cache coherency

Proceedings of the 2008 ACM SIGPLAN workshop on Memory systems performance and correctness: held in conjunction with the Thirteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '08)
Rethink the sync

ACM Transactions on Computer Systems (TOCS)
COMIC: a coherent shared memory interface for cell be

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
A comparative evaluation of hybrid distributed shared-memory systems

Journal of Systems Architecture: the EUROMICRO Journal
Disaggregated memory for expansion and sharing in blade servers

Proceedings of the 36th annual international symposium on Computer architecture
Engineering Distributed Shared Memory Middleware for Java

OTM '09 Proceedings of the Confederated International Conferences, CoopIS, DOA, IS, and ODBASE 2009 on On the Move to Meaningful Internet Systems: Part I
Cohesion: a hybrid memory model for accelerators

Proceedings of the 37th annual international symposium on Computer architecture
Exploiting locality: a flexible DSM approach

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Compiler-Assisted software DSM on a WAN cluster

PDCAT'04 Proceedings of the 5th international conference on Parallel and Distributed Computing: applications and Technologies
The data diffusion space for parallel computing in clusters

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Reflex: using low-power processors in smartphones without knowing them

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Protozoa: adaptive granularity cache coherence

Proceedings of the 40th Annual International Symposium on Computer Architecture
OCTET: capturing and controlling cross-thread dependences efficiently

Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications
Scale-out NUMA

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
K2: a mobile operating system for heterogeneous coherence domains

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes Shasta, a system that supports a shared address space in software on clusters of computers with physically distributed memory. A unique aspect of Shasta compared to most other software distributed shared memory systems is that shared data can be kept coherent at a fine granularity. In addition, the system allows the coherence granularity to vary across different shared data structures in a single application. Shasta implements the shared address space by transparently rewriting the application executable to intercept loads and stores. For each shared load or store, the inserted code checks to see if the data is available locally and communicates with other processors if necessary. The system uses numerous techniques to reduce the run-time overhead of these checks. Since Shasta is implemented entirely in software, it also provides tremendous flexibility in supporting different types of cache coherence protocols. We have implemented an efficient cache coherence protocol that incorporates a number of optimizations, including support for multiple communication granularities and use of relaxed memory models. This system is fully functional and runs on a cluster of Alpha workstations.The primary focus of this paper is to describe the techniques used in Shasta to reduce the checking overhead for supporting fine granularity sharing in software. These techniques include careful layout of the shared address space, scheduling the checking code for efficient execution on modern processors, using a simple method that checks loads using only the value loaded, reducing the extra cache misses caused by the checking code, and combining the checks for multiple loads and stores. To characterize the effect of these techniques, we present detailed performance results for the SPLASH-2 applications running on an Alpha processor. Without our optimizations, the checking overheads are excessively high, exceeding 100% for several applications. However, our techniques are effective in reducing these overheads to a range of 5% to 35% for almost all of the applications. We also describe our coherence protocol and present some preliminary results on the parallel performance of several applications running on our workstation cluster. Our experience so far indicates that once the cost of checking memory accesses is reduced using our techniques, the Shasta approach is an attractive software solution for supporting a shared address space with fine-grain access to data.