The DASH prototype: implementation and performance

Authors:
Daniel Lenoski;James Laudon;Truman Joe;David Nakahira;Luis Stevens;Anoop Gupta;John Hennessy
Affiliations:
-;-;-;-;-;-;-
Venue:
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Year:
1992

Citing 10
Cited 54

Portable programs for parallel processors

Portable programs for parallel processors
An evaluation of directory schemes for cache coherence

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
LimitLESS directories: A scalable cache coherence scheme

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The Stanford Dash Multiprocessor

Computer
The design and analysis of DASH: a scalable directory-based multiprocessor

The design and analysis of DASH: a scalable directory-based multiprocessor
An empirical evaluation of two memory-efficient directory methods

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
The directory-based cache coherence protocol for the DASH multiprocessor

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
A low-overhead coherence solution for multiprocessors with private cache memories

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
VLSI Mesh Routing Systems

VLSI Mesh Routing Systems
SPLASH: Stanford parallel applications for shared-memory

SPLASH: Stanford parallel applications for shared-memory

MemSpy: analyzing memory system bottlenecks in programs

SIGMETRICS '92/PERFORMANCE '92 Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Application-controlled physical memory using external page-cache management

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Volume rendering on scalable shared-memory MIMD architectures

VVS '92 Proceedings of the 1992 workshop on Volume visualization
Data locality and load balancing in COOL

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Recent trends in experimental operating systems research

PODC '93 Proceedings of the twelfth annual ACM symposium on Principles of distributed computing
EMC-Y: parallel processing element optimizing communication and computation

ICS '93 Proceedings of the 7th international conference on Supercomputing
Minimal adaptive routing on the mesh with bounded queue size

SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Programming, compilation, and resource management issues for multithreading (panel session II)

ACM SIGARCH Computer Architecture News - Special issue: panel sessions of the 1991 workshop on multithreaded computers
Exploring the design space for a shared-cache multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
SUIF: an infrastructure for research on parallelizing and optimizing compilers

ACM SIGPLAN Notices
Scheduling and page migration for multiprocessor compute servers

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
A performance evaluation of lock-free synchronization protocols

PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
Data and computation transformations for multiprocessors

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
ROMM routing on mesh and torus networks

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
A comprehensive bibliography of distributed shared memory

ACM SIGOPS Operating Systems Review
The benefits of clustering in shared address space multiprocessors: an applications-driven investigation

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Unified compilation techniques for shared and distributed address space machines

ICS '95 Proceedings of the 9th international conference on Supercomputing
An analytical model of high performance superscalar-based multiprocessors

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
The M-Machine multicomputer

Proceedings of the 28th annual international symposium on Microarchitecture
Application and architectural bottlenecks in large scale distributed shared memory machines

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Fine grain parallel communication on general purpose LANs

ICS '96 Proceedings of the 10th international conference on Supercomputing
Symphony: a simulation backplane for parallel mixed-mode co-simulation of VLSI systems

DAC '96 Proceedings of the 33rd annual Design Automation Conference
Network-Based Multicomputers: A Practical Supercomputer Architecture

IEEE Transactions on Parallel and Distributed Systems
Application restructuring and performance portability on shared virtual memory and hardware-coherent multiprocessors

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Improving parallel shear-warp volume rendering on shared address space multiprocessors

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Effects of communication latency, overhead, and bandwidth in a cluster architecture

Proceedings of the 24th annual international symposium on Computer architecture
Design and implementation of the NUMAchine multiprocessor

DAC '98 Proceedings of the 35th annual Design Automation Conference
A methodology and an evaluation of the SGI Origin2000

SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Flexible use of memory for replication/migration in cache-coherent DSM multiprocessors

Proceedings of the 25th annual international symposium on Computer architecture
The design, implementation, and evaluation of Jade

ACM Transactions on Programming Languages and Systems (TOPLAS)
Commit-reconcile & fences (CRF): a new memory model for architects and compiler writers

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Workload Execution Strategies and Parallel Speedup on Clustered Computers

IEEE Transactions on Computers
Performance experiences on Sun's Wildfire prototype

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Hardware spatial forwarding for widely shared data

Proceedings of the 14th international conference on Supercomputing
Mapping Parallel Application Communication Topology to Rhombic Overlapping-Cluster Multiprocessors

The Journal of Supercomputing
Submesh Determination in Faulty Tori and Meshes

IEEE Transactions on Parallel and Distributed Systems
Strategies optimization and integration in DSM

ACM SIGOPS Operating Systems Review
WWW visualisation of computer architecture simulations

Proceedings of the 7th annual conference on Innovation and technology in computer science education
Improving the performance of DSM systems via compiler involvement

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
An Evaluation of a Commercial CC-NUMA Architecture: The CONVEX Exemplar SPP1200

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Distributed Submesh Determination in Faulty Tori and Meshes

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Simulating the DASH Architecture in HASE

SS '96 Proceedings of the 29th Annual Simulation Symposium (SS '96)
The NUMAchine Multiprocessor

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Distance-aware L2 cache organizations for scalable multiprocessor systems

Journal of Systems Architecture: the EUROMICRO Journal - Special issue: Reconfigurable embedded systems: Synthesis, design and application
In-Network Cache Coherence

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Torus Ring: improving performance of interconnection network by modifying hierarchical ring

Parallel Computing
HP scalable computing architecture

WIESS'00 Proceedings of the 1st conference on Industrial Experiences with Systems Software - Volume 1
The Power of Priority: NoC Based Distributed Cache Coherency

NOCS '07 Proceedings of the First International Symposium on Networks-on-Chip
HiPEC: high performance external virtual memory caching

OSDI '94 Proceedings of the 1st USENIX conference on Operating Systems Design and Implementation
On the importance of parallel application placement in NUMA multiprocessors

Sedms'93 USENIX Systems on USENIX Experiences with Distributed and Multiprocessor Systems - Volume 4
False sharing and its effect on shared memory performance

Sedms'93 USENIX Systems on USENIX Experiences with Distributed and Multiprocessor Systems - Volume 4
The case for simple, visible cache coherency

Proceedings of the 2008 ACM SIGPLAN workshop on Memory systems performance and correctness: held in conjunction with the Thirteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '08)
Matrix: adaptive middleware for distributed multiplayer games

Proceedings of the ACM/IFIP/USENIX 2005 International Conference on Middleware
Matrix: adaptive middleware for distributed multiplayer games

Middleware'05 Proceedings of the ACM/IFIP/USENIX 6th international conference on Middleware

Quantified Score

Hi-index	0.00

Visualization

Abstract

The fundamental premise behind the DASH project is that it is feasible to build large-scale shared-memory multiprocessors with hardware cache coherence. While paper studies and software simulators are useful for understanding many high-level design trade-offs, prototypes are essential to ensure that no critical details are overlooked. A prototype provides convincing evidence of the feasibility of the design allows one to accurately estimate both the hardware and the complexity cost of various features, and provides a platform for studying real workloads. A 16-processor prototype of the DASH multiprocessor has been operational for the last six months. In this paper, the hardware overhead of directory-based cache coherence in the prototype is examined. We also discuss the performance of the system, and the speedups obtained by parallel applications running on the prototype. Using a sophisticated hardware performance monitor, we characterize the effectiveness of coherent caches and the relationship between an application's reference behavior and its speedup.