The Stanford Dash Multiprocessor

Authors:
Daniel Lenoski;James Laudon;Kourosh Gharachorloo;Wolf-Dietrich Weber;Anoop Gupta;John Hennessy;Mark Horowitz;Monica S. Lam
Affiliations:
-;-;-;-;-;-;-;-
Venue:
Computer
Year:
1992

Citing 12
Cited 279

Cache coherence protocols: evaluation using a multiprocessor simulation model

ACM Transactions on Computer Systems (TOCS)
Correct memory operation of cache-based multiprocessors

ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
Hierarchical cache/bus architecture for shared memory multiprocessors

ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
Efficient synchronization primitives for large-scale cache-coherent multiprocessors

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Scalable coherent interface

Computer
Paradigm: A Highly Scalable Shared-Memory Multicomputer Architecture

Computer - Special issue on cryptography
LimitLESS directories: A scalable cache coherence scheme

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Performance evaluation of memory consistency models for shared-memory multiprocessors

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Tolerating latency through software-controlled prefetching in shared-memory multiprocessors

Journal of Parallel and Distributed Computing - Special issue on shared-memory multiprocessors
Comparative evaluation of latency reducing and tolerating techniques

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
An empirical evaluation of two memory-efficient directory methods

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
The directory-based cache coherence protocol for the DASH multiprocessor

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture

The DASH prototype: implementation and performance

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Cooperative shared memory: software and hardware for scalable multiprocessor

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Volume rendering on scalable shared-memory MIMD architectures

VVS '92 Proceedings of the 1992 workshop on Volume visualization
Heterogeneous parallel programming in Jade

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Dynamic object management for distributed data structures

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Global optimizations for parallelism and locality on scalable parallel machines

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Integrating message-passing and shared-memory: early experience

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Data locality and load balancing in COOL

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Restructuring a parallel simulation to improve cache behavior in a shared-memory multiprocessor: the value of distributed synchronization

PADS '93 Proceedings of the seventh workshop on Parallel and distributed simulation
Cache coherence in large-scale shared-memory multiprocessors: issues and comparisons

ACM Computing Surveys (CSUR)
Memory consistency models

ACM SIGOPS Operating Systems Review
Cooperative shared memory: software and hardware for scalable multiprocessors

ACM Transactions on Computer Systems (TOCS)
Issues and directions in scalable parallel computing

PODC '93 Proceedings of the twelfth annual ACM symposium on Principles of distributed computing
An adaptive cache coherence protocol optimized for migratory sharing

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Mechanisms for cooperative shared memory

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Object distribution in Orca using Compile-Time and Run-Time techniques

OOPSLA '93 Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications
Anatomy of a message in the Alewife multiprocessor

ICS '93 Proceedings of the 7th international conference on Supercomputing
Super-threading: architectural and software mechanisms for optimizing parallel computation

ICS '93 Proceedings of the 7th international conference on Supercomputing
Integrating volume data analysis and rendering on distributed memory architectures

PRS '93 Proceedings of the 1993 symposium on Parallel rendering
Parallel volume-rendering algorithm performance on mesh-connected multicomputers

PRS '93 Proceedings of the 1993 symposium on Parallel rendering
The Wisconsin Wind Tunnel: virtual prototyping of parallel computers

SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Developing parallel applications using high-performance simulation

PADD '93 Proceedings of the 1993 ACM/ONR workshop on Parallel and distributed debugging
Cache coherence using local knowledge

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Compiling for shared-memory and message-passing computers

ACM Letters on Programming Languages and Systems (LOPLAS)
On testing cache-coherent shared memories

SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Performance evaluation of hybrid hardware and software distributed shared memory protocols

ICS '94 Proceedings of the 8th international conference on Supercomputing
Design and implementation of a prototype optical deflection network

SIGCOMM '94 Proceedings of the conference on Communications architectures, protocols and applications
A comparison of message passing and shared memory architectures for data parallel programs

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Software versus hardware shared-memory implementation: a case study

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Software-extended coherent shared memory: performance and cost

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Tempest and typhoon: user-level shared memory

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Parallel sorting by over partitioning

SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Reactive synchronization algorithms for multiprocessors

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Simple compiler algorithms to reduce ownership overhead in cache coherence protocols

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Fine-grain access control for distributed shared memory

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Mixed consistency: a model for parallel programming (extended abstract)

PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
The design of RPM: an FPGA-based multiprocessor emulator

FPGA '95 Proceedings of the 1995 ACM third international symposium on Field-programmable gate arrays
A Hierarchical Task Queue Organization for Shared-Memory Multiprocessor Systems

IEEE Transactions on Parallel and Distributed Systems
The Potential of Compile-Time Analysis to Adapt the Cache Coherence Enforcement Strategy to the Data Sharing Characteristics

IEEE Transactions on Parallel and Distributed Systems
SP2 system architecture

IBM Systems Journal
Reducing false sharing on shared memory multiprocessors through compile time data transformations

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
A comprehensive bibliography of distributed shared memory

ACM SIGOPS Operating Systems Review
Efficient shared memory with minimal hardware support

ACM SIGARCH Computer Architecture News
Memory system performance of UNIX on CC-NUMA multiprocessors

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
On characterizing bandwidth requirements of parallel applications

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Hive: fault containment for shared-memory multiprocessors

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
CRL: high-performance all-software distributed shared memory

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Lazy release consistency for hardware-coherent multiprocessors

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Architectural mechanisms for explicit communication in shared memory multiprocessors

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Data forwarding in scalable shared-memory multiprocessors

ICS '95 Proceedings of the 9th international conference on Supercomputing
Multithreading with the EM-4 distributed-memory multiprocessor

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
A compiler algorithm that reduces read latency in ownership-based cache coherence protocols

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Dynamic Partitioning of Non-Uniform Structured Workloads with Spacefilling Curves

IEEE Transactions on Parallel and Distributed Systems
Teapot: language support for writing memory coherence protocols

PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
pHluid: the design of a parallel functional language implementation on workstations

Proceedings of the first ACM SIGPLAN international conference on Functional programming
Decoupled hardware support for distributed shared memory

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
MGS: a multigrain shared memory system

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
COMA: an opportunity for building fault-tolerant scalable shared memory multiprocessors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Application and architectural bottlenecks in large scale distributed shared memory machines

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Coherent network interfaces for fine-grain communication

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Limits on the performance benefits of multithreading and prefetching

Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Embra: fast and flexible machine simulation

Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Using dataflow analysis techniques to reduce ownership overhead in cache coherence protocols

ACM Transactions on Programming Languages and Systems (TOPLAS)
Designing Clustered Multiprocessor Systems under Packaging and Technological Advancements

IEEE Transactions on Parallel and Distributed Systems
A flexible operation execution model for shared distributed objects

Proceedings of the 11th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
The GLOW cache coherence protocol extensions for widely shared data

ICS '96 Proceedings of the 10th international conference on Supercomputing
Evaluating virtual channels for cache-coherent shared-memory multiprocessors

ICS '96 Proceedings of the 10th international conference on Supercomputing
Conservative circuit simulation on shared-memory multiprocessors

PADS '96 Proceedings of the tenth workshop on Parallel and distributed simulation
State reduction using reversible rules

DAC '96 Proceedings of the 33rd annual Design Automation Conference
Integrating formal verification methods with a conventional project design flow

DAC '96 Proceedings of the 33rd annual Design Automation Conference
Fast Parallel Sorting Under LogP: Experience with the CM-5

IEEE Transactions on Parallel and Distributed Systems
The Block Distributed Memory Model

IEEE Transactions on Parallel and Distributed Systems
An Architecture for Tolerating Processor Failures in Shared-Memory Multiprocessors

IEEE Transactions on Computers
Scheduler-conscious synchronization

ACM Transactions on Computer Systems (TOCS)
Data Forwarding in Scalable Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Compressionless Routing: A Framework for Adaptive and Fault-Tolerant Routing

IEEE Transactions on Parallel and Distributed Systems
Fusion of Loops for Parallelism and Locality

IEEE Transactions on Parallel and Distributed Systems
A performance evaluation of cluster architectures

SIGMETRICS '97 Proceedings of the 1997 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
HFS: a performance-oriented flexible file system based on building-block compositions

ACM Transactions on Computer Systems (TOCS)
Parallel breadth-first BDD construction

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Relaxed consistency and coherence granularity in DSM systems: a performance evaluation

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
The Mercury Interconnect Architecture: a cost-effective infrastructure for high-performance servers

Proceedings of the 24th annual international symposium on Computer architecture
Efficient synchronization: let them eat QOLB

Proceedings of the 24th annual international symposium on Computer architecture
Coherence controller architectures for SMP-based CC-NUMA multiprocessors

Proceedings of the 24th annual international symposium on Computer architecture
Reactive NUMA: a design for unifying S-COMA and CC-NUMA

Proceedings of the 24th annual international symposium on Computer architecture
A Survey of Recoverable Distributed Shared Virtual Memory Systems

IEEE Transactions on Parallel and Distributed Systems
Parallel hierarchical computation of specular radiosity

PRS '97 Proceedings of the IEEE symposium on Parallel rendering
An interaction of coherence protocols and memory consistency models in DSM systems

ACM SIGOPS Operating Systems Review
Performance evaluation of the Orca shared-object system

ACM Transactions on Computer Systems (TOCS)
Tolerating latency in multiprocessors through compiler-inserted prefetching

ACM Transactions on Computer Systems (TOCS)
A Cost and Speed Model for k-ary n-Cube Wormhole Routers

IEEE Transactions on Parallel and Distributed Systems
Digital system simulation: methodologies and examples

DAC '98 Proceedings of the 35th annual Design Automation Conference
A study of three dynamic approaches to handle widely shared data in shared-memory multiprocessors

ICS '98 Proceedings of the 12th international conference on Supercomputing
Informing memory operations: memory performance feedback mechanisms and their applications

ACM Transactions on Computer Systems (TOCS)
The DASH prototype: implementation and performance

25 years of the international symposia on Computer architecture (selected papers)
The Stanford FLASH multiprocessor

25 years of the international symposia on Computer architecture (selected papers)
Tempest and typhoon: user-level shared memory

25 years of the international symposia on Computer architecture (selected papers)
Hardware Support for Flexible Distributed Shared Memory

IEEE Transactions on Computers
Wormhole routing techniques for directly connected multicomputer systems

ACM Computing Surveys (CSUR)
Automatic Compiler-Inserted Prefetching for Pointer-Based Applications

IEEE Transactions on Computers - Special issue on cache memory and related problems
A Quantitative Analysis of the Performance and Scalability of Distributed Shared Memory Cache Coherence Protocols

IEEE Transactions on Computers - Special issue on cache memory and related problems
Coherence Controller Architectures for Scalable Shared-Memory Multiprocessors

IEEE Transactions on Computers - Special issue on cache memory and related problems
Excel-NUMA: Toward Programmability, Simplicity, and High Performance

IEEE Transactions on Computers - Special issue on cache memory and related problems
An Application-Driven Study of Parallel System Overheads and Network Bandwidth Requirements

IEEE Transactions on Parallel and Distributed Systems
Is SC + ILP = RC?

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Multicast snooping: a new coherence method using a multicast address network

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Comparing the memory system performance of the HP V-class and SGI Origin 2000 multiprocessors using microbenchmarks and scientific applications

ICS '99 Proceedings of the 13th international conference on Supercomputing
Formal verification in hardware design: a survey

ACM Transactions on Design Automation of Electronic Systems (TODAES)
The QRQW PRAM: accounting for contention in parallel algorithms

SODA '94 Proceedings of the fifth annual ACM-SIAM symposium on Discrete algorithms
SP2 system architecture

IBM Systems Journal
Teapot: A Domain-Specific Language for Writing Cache Coherence Protocols

IEEE Transactions on Software Engineering
PiSMA: a parallel VSM architecture

Crossroads
Improving parallel system performance by changing the arrangement of the network links

Proceedings of the 14th international conference on Supercomputing
An asynchronous protocol for release consistent distributed shared memory systems

SAC '00 Proceedings of the 2000 ACM symposium on Applied computing - Volume 2
Comparing the effectiveness of fine-grain memory caching against page migration/replication in reducing traffic in DSM clusters

Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures
A Class of Highly Scalable Optical Crossbar-Connected Interconnection Networks (SOCNs) for Parallel Computing Systems

IEEE Transactions on Parallel and Distributed Systems
The Odd-Even Turn Model for Adaptive Routing

IEEE Transactions on Parallel and Distributed Systems
Design and Evaluation of a Switch Cache Architecture for CC-NUMA Multiprocessors

IEEE Transactions on Computers
Dynamic Task Scheduling Using Online Optimization

IEEE Transactions on Parallel and Distributed Systems
Exploiting Network Locality for CC-NUMA Multiprocessors

The Journal of Supercomputing
Efficient schemes to scale the interconnection network bandwidth in a ring-based multiprocessor system

Proceedings of the 2001 ACM symposium on Applied computing
Parallelizing the Murϕ Verifier

Formal Methods in System Design - Special issue on CAV '97
ADir_pNB: A Cost-Effective Way to Implement Full Map Directory-Based Cache Coherence Protocols

IEEE Transactions on Computers
A Cost-Effective Approach to Deadlock Handling in Wormhole Networks

IEEE Transactions on Parallel and Distributed Systems
A Fast and Efficient Processor Allocation Scheme for Mesh-Connected Multicomputers

IEEE Transactions on Computers
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Deriving Efficient Cache Coherence Protocols Through Refinement

Formal Methods in System Design
Modeling of interconnection subsystems for massively parallel computers

Performance Evaluation
Tolerating node failures in cache only memory architectures

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Application-specific protocols for user-level shared memory

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Compiler Support for Array Distribution onNUMA Shared Memory Multiprocessors

The Journal of Supercomputing
Achieving Robustness and Minimizing Overhead in Parallel Algorithms Through Overlapped Communication/Computation

The Journal of Supercomputing - Special issue on embedded fault-tolerance systems
An Application-Driven Study of Multicast Communication for Write Invalidation

The Journal of Supercomputing
Load Balancing for Parallel Query Execution on NUMA Multiprocessors

Distributed and Parallel Databases
A Simulation Study of Hardware-Oriented DSM Approaches

IEEE Parallel & Distributed Technology: Systems & Technology
Distributed Shared Memory: Concepts and Systems

IEEE Parallel & Distributed Technology: Systems & Technology
Jade: A High-Level, Machine-Independent Language for Parallel Programming

Computer
COOL: An Object-Based Language for Parallel Programming

Computer
RPM: A Rapid Prototyping Engine for Multiprocessor Systems

Computer
Application Performance on the MIT Alewife Machine

Computer
Boosting the Performance of Shared Memory Multiprocessors

Computer
Performance Analysis of Cluster-Based Multiprocessors

IEEE Transactions on Computers
The DASH Prototype: Logic Overhead and Performance

IEEE Transactions on Parallel and Distributed Systems
Performance Analysis of Mesh Interconnection Networks with Deterministic Routing

IEEE Transactions on Parallel and Distributed Systems
Sequential Hardware Prefetching in Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Performance Analysis of Four Memory Consistency Models for Multithreaded Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Packet Synchronization for Synchronous Optical Deflection-Routed Interconnection Networks

IEEE Transactions on Parallel and Distributed Systems
Encapsulation of Parallelism and Architecture-Independence in Extensible Database Query Execution

IEEE Transactions on Software Engineering
A General Data Layout for Distributed Consistency in Data Parallel Applications

HiPC '02 Proceedings of the 9th International Conference on High Performance Computing
Network Performance under Physical Constraints

ICPP '97 Proceedings of the international Conference on Parallel Processing
Kiloprocessor Extensions to SCI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Benefits of Processor Clustering in Designing Large Parallel Systems: When and How?

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Deadlock- and Livelock-Free Routing Protocols for Wave Switching

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
SmartApps: An Application Centric Approach to High Performance Computing: Compiler-Assisted Software and Hardware Support for Reduction Operations

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
An Architecture and Task Scheduling Algorithm for Systems Based on Dynamically Reconfigurable Shared Memory Clusters

IWCC '01 Proceedings of the NATO Advanced Research Workshop on Advanced Environments, Tools, and Applications for Cluster Computing-Revised Papers
Embedded Cluster Computing through Dynamic Reconfigurability of Inter-Processor Connections

IWCC '01 Proceedings of the NATO Advanced Research Workshop on Advanced Environments, Tools, and Applications for Cluster Computing-Revised Papers
A High-Level Programming Environment for Distributed Memory Architectures

PaCT '999 Proceedings of the 5th International Conference on Parallel Computing Technologies
A Parallel System Architecture Based on Dynamically Configurable Shared Memory Clusters

PPAM '01 Proceedings of the th International Conference on Parallel Processing and Applied Mathematics-Revised Papers
Task Scheduling for Dynamically Configurable Multiple SMP Clusters Based on Extended DSC Approach

PPAM '01 Proceedings of the th International Conference on Parallel Processing and Applied Mathematics-Revised Papers
How Can We Design Better Networks for DSM Systems?

PCRCW '97 Proceedings of the Second International Workshop on Parallel Computer Routing and Communication
A Compiler-Assisted Scheme for Adaptive Cache Coherence Enforcement

PACT '94 Proceedings of the IFIP WG10.3 Working Conference on Parallel Architectures and Compilation Techniques
Measuring Consistency Costs for Distributed Shared Data

LCR '00 Selected Papers from the 5th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
Active Memory Clusters: Efficient Multiprocessing on Commodity Clusters

ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
SMP system interconnect instrumentation for performance analysis

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Efficient synchronization for nonuniform communication architectures

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Processor Allocation in the Mesh Multiprocessors Using the Leapfrog Method

IEEE Transactions on Parallel and Distributed Systems
Cluster Queue Structure for Shared-Memory Multiprocessor Systems

The Journal of Supercomputing
Inferential queueing and speculative push for reducing critical communication latencies

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
MORPH: a system architecture for robust high performance using customization (an NSF 100 TeraOps point design study)

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
A Low-Complexity Parallel System of Gracious, Scalable Performance Case Study forNear PetaFLOPS Computing

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Complexity and Performance in Parallel Programming Languages

HIPS '97 Proceedings of the 1997 Workshop on High-Level Programming Models and Supportive Environments (HIPS '97)
Efficient and balanced adaptive routing in two-dimensional meshes

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Abstracting network characteristics and locality properties of parallel systems

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Software cache coherence for large scale multiprocessors

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
A Shared-bus Control Mechanism and a Cache Coherence Protocol for a High-performance On-chip Multiprocessor

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
A Cache Coherency Protocol for Optically Connected Parallel Computer Systems

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Using memory-mapped network interfaces to improve the performance of distributed shared memory

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Performance Evaluation of a Cluster-Based Multiprocessor Built from ATM Switches and Bus-Based Multiprocessor Servers

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Hierarchical Backoff Locks for Nonuniform Communication Architectures

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Using cache optimizing compiler for managing software cache on distributed shared memory system

HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
Trojan: A High-Performance Simulator for Shared Memory Architectures

SS '96 Proceedings of the 29th Annual Simulation Symposium (SS '96)
The Thread-Based Protocol Engines for CC-NUMA Multiprocessors

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Modeling and evaluating the time overhead induced by BER in COMA multiprocessors

Journal of Systems Architecture: the EUROMICRO Journal
Token coherence: decoupling performance and correctness

Proceedings of the 30th annual international symposium on Computer architecture
(De-) Clustering Objects for Multiprocessor System Software

IWOOOS '95 Proceedings of the 4th International Workshop on Object-Orientation in Operating Systems
Latency, Occupancy, and Bandwidth in DSM Multiprocessors: A Performance Evaluation

IEEE Transactions on Computers
FC3D: Flow Control-Based Distributed Deadlock Detection Mechanism for True Fully Adaptive Routing in Wormhole Networks

IEEE Transactions on Parallel and Distributed Systems
References

Sourcebook of parallel computing
The Impact of Negative Acknowledgments in Shared Memory Scientific Applications

IEEE Transactions on Parallel and Distributed Systems
Stateful distributed interposition

ACM Transactions on Computer Systems (TOCS)
Characterization and Evaluation of Cache Hierarchies for Web Servers

World Wide Web
SMTp: An Architecture for Next-generation Scalable Multi-threading

Proceedings of the 31st annual international symposium on Computer architecture
Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams

Proceedings of the 31st annual international symposium on Computer architecture
Exploring Virtual Network Selection Algorithms in DSM Cache Coherence Protocols

IEEE Transactions on Parallel and Distributed Systems
An Architecture for High-Performance Scalable Shared-Memory Multiprocessors Exploiting On-Chip Integration

IEEE Transactions on Parallel and Distributed Systems
CAS-DSM: a compiler assisted software distributed shared memory

International Journal of Parallel Programming
Towards scalable collective communication for multicomputer interconnection networks

Information Sciences: an International Journal - Special issue: Information technology
Coherence decoupling: making use of incoherence

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Speculative Incoherent Cache Protocols

IEEE Micro
Prediction model for evaluation of reconfigurable interconnects in distributed shared-memory systems

Proceedings of the 2005 international workshop on System level interconnect prediction
Traffic Temporal Analysis for Reconfigurable Interconnects in Shared-Memory Systems

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 3 - Volume 04
Sequential consistency and the lazy caching algorithm

Distributed Computing - Special issue: Verification of lazy caching
Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling

Proceedings of the 32nd annual international symposium on Computer Architecture
Formal Verification and its Impact on the Snooping versus Directory Protocol Debate

ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Congestion modeling for reconfigurable inter-processor networks

Proceedings of the 2006 international workshop on System-level interconnect prediction
Inferential queueing and speculative push

International Journal of Parallel Programming - Special issue I: The 17th annual international conference on supercomputing (ICS'03)
Optimal and efficient parallel tridiagonal solvers using direct methods

The Journal of Supercomputing - Special issue: Parallel and distributed processing and applications
An experimental evaluation of the HP V-class and SGI origin 2000 multiprocessors using microbenchmarks and scientific applications

International Journal of Parallel Programming
Efficiently generating test vectors with state pruning

Proceedings of the 2005 Asia and South Pacific Design Automation Conference
An efficient cache design for scalable glueless shared-memory multiprocessors

Proceedings of the 3rd conference on Computing frontiers
Symmetry in temporal logic model checking

ACM Computing Surveys (CSUR)
A plane-based broadcast algorithm for multicomputer networks

Journal of Systems Architecture: the EUROMICRO Journal
On balancing network traffic in path-based multicast communication

Future Generation Computer Systems - Systems performance analysis and evaluation
Tight Bounds for Critical Sections in Processor Consistent Platforms

IEEE Transactions on Parallel and Distributed Systems
Support for High-Frequency Streaming in CMPs

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
K42: building a complete operating system

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
The psi-cube: a bus-based cube-type clustering network for high-performance on-chip systems

Parallel Computing
Probabilistic analysis on mesh network fault tolerance

Journal of Parallel and Distributed Computing
Predicting reconfigurable interconnect performance in distributed shared-memory systems

Integration, the VLSI Journal
Proximity-aware directory-based coherence for multi-core processor architectures

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
BulkSC: bulk enforcement of sequential consistency

Proceedings of the 34th annual international symposium on Computer architecture
Comparison of Mesh and Hierarchical Networks for Multiprocessors

ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 01
Reducing the Write Traffic for a Hybrid Cache Protocol

ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 01
The Power of Priority: NoC Based Distributed Cache Coherency

NOCS '07 Proceedings of the First International Symposium on Networks-on-Chip
The design and evaluation of a shared object system for distributed memory machines

OSDI '94 Proceedings of the 1st USENIX conference on Operating Systems Design and Implementation
Brazos: a third generation DSM system

NT'97 Proceedings of the USENIX Windows NT Workshop on The USENIX Windows NT Workshop 1997
Experience with a language for writing coherence protocols

DSL'97 Proceedings of the Conference on Domain-Specific Languages on Conference on Domain-Specific Languages (DSL), 1997
Experience distributing objects in an SMMP OS

ACM Transactions on Computer Systems (TOCS)
The mechanics of in-kernel synchronization for a scalable microkernel

ACM SIGOPS Operating Systems Review
An accurate performance model of fully adaptive routing in wormhole-switched two-dimensional mesh multicomputers

Microprocessors & Microsystems
Cache coherency communication cost in a NoC-based MPSoC platform

Proceedings of the 20th annual conference on Integrated circuits and systems design
Performance of deterministic and adaptive broadcast algorithms in multicomputer networks

International Journal of High Performance Computing and Networking
A case for low-complexity MP architectures

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
IBM POWER6 microarchitecture

IBM Journal of Research and Development
Investigating solution convergence in a global ocean model using a 2048-processor cluster of distributed shared memory machines

Scientific Programming - High Performance Computing for Mission-Enabling Space Applications
A Retrospective on Murφ

25 Years of Model Checking
Two proposals for the inclusion of directory information in the last-level private caches of glueless shared-memory multiprocessors

Journal of Parallel and Distributed Computing
Dynamic security domain scaling on embedded symmetric multiprocessors

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Disaggregated memory for expansion and sharing in blade servers

Proceedings of the 36th annual international symposium on Computer architecture
A Novel Cache Organization for Tiled Chip Multiprocessor

APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
Impact of level-2 cache sharing on the performance and power requirements of homogeneous multicore embedded systems

Microprocessors & Microsystems
A tuneable software cache coherence protocol for heterogeneous MPSoCs

CODES+ISSS '09 Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis
High-throughput coherence control and hardware messaging in everest

IBM Journal of Research and Development
Fault-tolerant mapping of a mesh network in a flexible hypercube

WSEAS Transactions on Computers
Lower bounds on the connectivity probability for 2-D mesh networks

WiCOM'09 Proceedings of the 5th International Conference on Wireless communications, networking and mobile computing
On-chip communication and synchronization mechanisms with cache-integrated network interfaces

Proceedings of the 7th ACM international conference on Computing frontiers
Fault-tolerant meshes and tori embedded in a faulty supercube

WSEAS Transactions on Computers
Scalable hardware support for conditional parallelization

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
SICOSYS: an integrated framework for studying interconnection network performance in multiprocessor systems

EUROMICRO-PDP'02 Proceedings of the 10th Euromicro conference on Parallel, distributed and network-based processing
Atomic operations for task scheduling for systems based on communication on-the-fly between SMP clusters

ISPDC'03 Proceedings of the Second international conference on Parallel and distributed computing
Dynamic SMP clusters with communication on the fly

ISPDC'03 Proceedings of the Second international conference on Parallel and distributed computing
Architectural support for thread communications in multi-core processors

Parallel Computing
Architectural Support for Fair Reader-Writer Locking

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Performance comparison of some shared memory organizations for 2D mesh-like NOCs

Microprocessors & Microsystems
Efficient synchronization for embedded on-chip multiprocessors

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Embedding meshes into twisted-cubes

Information Sciences: an International Journal
Simulation of meshes and tori in an asymmetric faulty incrementally extensible hypercube with unbounded expansion

WSEAS Transactions on Information Science and Applications
Probabilistic analysis of time reduction by eliminating barriers in parallel programmes

International Journal of Communication Networks and Distributed Systems
On fault-tolerant embedding of meshes and tori in a flexible hypercube with unbounded expansion

WSEAS TRANSACTIONS on SYSTEMS
A hardware supported multicast scheme based on XY routing for 2-D mesh InfiniBand networks

The Journal of Supercomputing
View-Oriented parallel programming and view-based consistency

PDCAT'04 Proceedings of the 5th international conference on Parallel and Distributed Computing: applications and Technologies
Upper bounds on the connection probability for 2-D meshes and tori

Journal of Parallel and Distributed Computing
Speeding-up synchronizations in DSM multiprocessors

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
On consistency of encrypted files

DISC'06 Proceedings of the 20th international conference on Distributed Computing
Barrier elimination based on access dependency analysis for OpenMP

ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications
Improving coherence protocol reactiveness by trading bandwidth for latency

Proceedings of the 9th conference on Computing Frontiers
Distributed memory virtualization with the use of SDDSfL

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part II
LIGERO: A light but efficient router conceived for cache-coherent chip multiprocessors

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Reducing Virtual-to-Physical address translation overhead in Distributed Shared Memory based multi-core Network-on-Chips according to data property

Computers and Electrical Engineering
Exploring topologies for source-synchronous ring-based network-on-chip

Proceedings of the Conference on Design, Automation and Test in Europe
A heterogeneous multiple network-on-chip design: an application-aware approach

Proceedings of the 50th Annual Design Automation Conference
Location-aware cache management for many-core processors with deep cache hierarchy

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Deflection routing in 3D network-on-chip with limited vertical bandwidth

ACM Transactions on Design Automation of Electronic Systems (TODAES) - Special Section on Networks on Chip: Architecture, Tools, and Methodologies
OCTET: capturing and controlling cross-thread dependences efficiently

Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications
Using in-flight chains to build a scalable cache coherence protocol

ACM Transactions on Architecture and Code Optimization (TACO)
Scale-out NUMA

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	4.13

Visualization

Abstract

The overall goals and major features of the directory architecture for shared memory (Dash) are presented. The fundamental premise behind the architecture is that it is possible to build a scalable high-performance machine with a single address space and coherent caches. The Dash architecture is scalable in that it achieves linear or near-linear performance growth as the number of processors increases from a few to a few thousand. This performance results from distributing the memory among processing nodes and using a network with scalable bandwidth to connect the nodes. The architecture allows shared data to be cached, significantly reducing the latency of memory accesses and yielding higher processor utilization and higher overall performance. A distributed directory-based protocol that provides cache coherence without compromising scalability is discussed in detail. The Dash prototype machine and the corresponding software support are described.