SoftFLASH: analyzing the performance of clustered distributed virtual shared memory

Authors:
Andrew Erlichson;Neal Nuckolls;Greg Chesson;John Hennessy
Affiliations:
Computer Systems Lab, Stanford University, Stanford, CA;Silicon Graphics Inc., 2011 North Shoreline Blvd., Mountain View, CA;Silicon Graphics Inc., 2011 North Shoreline Blvd., Mountain View, CA;Computer Systems Lab, Stanford University, Stanford, CA
Venue:
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Year:
1996

Citing 21
Cited 30

Low-synchronization translation lookaside buffer consistency in large-scale shared-memory multiprocessors

SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
The Amber system: parallel programming on a network of multiprocessors

SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
Multi-level shared caching techniques for scalability in VMP-M/C

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Memory coherence in shared virtual memory systems

ACM Transactions on Computer Systems (TOCS)
Delayed consistency and its effects on the miss rate of parallel programs

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Lazy release consistency for software distributed shared memory

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Performance evaluation of hybrid hardware and software distributed shared memory protocols

ICS '94 Proceedings of the 8th international conference on Supercomputing
Software versus hardware shared-memory implementation: a case study

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The performance advantages of integrating block data transfer in cache-coherent multiprocessors

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Design of the Munin distributed shared memory system

Journal of Parallel and Distributed Computing - Special issue on distributed shared memory systems
The MIT Alewife machine: architecture and performance

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
CRL: high-performance all-software distributed shared memory

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
The benefits of clustering in shared address space multiprocessors: an applications-driven investigation

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
MGS: a multigrain shared memory system

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Lazy release consistency for distributed shared memory

Lazy release consistency for distributed shared memory
Memory consistency and event ordering in scalable shared-memory multiprocessors

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Portable Programs for Parallel Processors

Portable Programs for Parallel Processors
Performance Evaluation of a Cluster-Based Multiprocessor Built from ATM Switches and Bus-Based Multiprocessor Servers

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
The Performance Advantages of Integrating Message Passing in Cache-Coherent Multiprocessors

The Performance Advantages of Integrating Message Passing in Cache-Coherent Multiprocessors

Shasta: a low overhead, software-only approach for supporting fine-grain shared memory

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Design and performance of the Shasta distributed shared memory protocol

ICS '97 Proceedings of the 11th international conference on Supercomputing
Relaxed consistency and coherence granularity in DSM systems: a performance evaluation

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
VM-based shared memory on low-latency, remote-memory-access networks

Proceedings of the 24th annual international symposium on Computer architecture
Towards transparent and efficient software distributed shared memory

Proceedings of the sixteenth ACM symposium on Operating systems principles
Cashmere-2L: software coherent shared memory on a clustered remote-write network

Proceedings of the sixteenth ACM symposium on Operating systems principles
Evaluation of hardware write propagation support for next-generation shared virtual memory clusters

ICS '98 Proceedings of the 12th international conference on Supercomputing
Predicting the performance of distributed virtual shared-memory applications

IBM Systems Journal
Using network interface support to avoid asynchronous protocol processing in shared virtual memory systems

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
MagPIe: MPI's collective communication operations for clustered wide area systems

Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
The scalability of multigrain systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
Comparative study of page-based and segment-based software DSM through compiler optimization

Proceedings of the 14th international conference on Supercomputing
A Programming Methodology for Dual-Tier Multicomputers

IEEE Transactions on Software Engineering - Special issue on architecture-independent languages and software tools for parallel processing
Multigrain shared memory

ACM Transactions on Computer Systems (TOCS)
Accelerating shared virtual memory via general-purpose network interface support

ACM Transactions on Computer Systems (TOCS)
MPI versus MPI+OpenMP on IBM SP for the NAS benchmarks

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Removing the overhead from software-based shared memory

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Design issues for a high-performance distributed shared memory on symmetrical multiprocessor clusters

Cluster Computing
Shared Virtual Memory Clusters with Next-Generation Interconnection Networks and Wide Compute Nodes

HiPC '01 Proceedings of the 8th International Conference on High Performance Computing
Performance of the NAS Benchmarks on a Cluster of SMP PCs Using a Parallelization of the MPI Programs with OpenMP

PaCT '999 Proceedings of the 5th International Conference on Parallel Computing Technologies
A New Home-Based Software DSM Protocol for SMP Clusters

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Evaluation of Compiler-Assisted Software DSM Schemes for a Workstation Cluster

IWIA '99 Proceedings of the 1999 International Workshop on Innovative Architecture
Using System Emulation to Model Next-Generation Shared Virtual Memory Clusters

Cluster Computing
Shared virtual memory clusters: bridging the cost-performance gap between SMPs and hardware DSM systems

Journal of Parallel and Distributed Computing
Evaluating scheduling policies for fine-grain communication protocols on a cluster of SMPs

Journal of Parallel and Distributed Computing
In-Kernel Integration of Operating System and Infiniband Functions for High Performance Computing Clusters: A DSM Example

IEEE Transactions on Parallel and Distributed Systems
Shared memory computing on clusters with symmetric multiprocessors and system area networks

ACM Transactions on Computer Systems (TOCS)
A Transparent Distributed Shared Memory for Clustered Symmetric Multiprocessors

The Journal of Supercomputing
Brazos: a third generation DSM system

NT'97 Proceedings of the USENIX Windows NT Workshop on The USENIX Windows NT Workshop 1997
Exploiting locality: a flexible DSM approach

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

One potentially attractive way to build large-scale shared-memory machines is to use small-scale to medium-scale shared-memory machines as clusters that are interconnected with an off-the-shelf network. To create a shared-memory programming environment across the clusters, it is possible to use a virtual shared-memory software layer. Because of the low latency and high bandwidth of the interconnect available within each cluster, there are clear advantages in making the clusters as large as possible. The critical question then becomes whether the latency and bandwidth of the top-level network and the software system are sufficient to support the communication demands generated by the clusters.To explore these questions, we have built an aggressive kernel implementation of a virtual shared-memory system using SGI multiprocessors and 100Mbyte/sec HIPPI interconnects. The system obtains speedups on 32 processors (four nodes, eight processors per node plus additional reserved protocol processors) that range from 6.9 on the communication-intensive FFT program to 21.6 on Ocean (both from the SPLASH 2 suite). In general, clustering is effective in reducing internode miss rates, but as the cluster size increases, increases in the remote latency, mostly due to increased TLB synchronization cost, offset the advantages. For communication-intensive applications, such as FFT, the overhead of sending out network requests, the limited network bandwidth, and the long network latency prevent the achievement of good performance. Overall, this approach still appears promising, but our results indicate that large low latency networks may be needed to make cluster-based virtual shared-memory machines broadly useful as large-scale shared-memory multiprocessors.