Memory coherence in shared virtual memory systems
ACM Transactions on Computer Systems (TOCS)
Implementation and performance of Munin
SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
The Stanford Dash Multiprocessor
Computer
A tightly-coupled processor-network interface
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Cooperative shared memory: software and hardware for scalable multiprocessor
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Mechanisms for cooperative shared memory
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
The Wisconsin Wind Tunnel: virtual prototyping of parallel computers
SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Parallel programming in Split-C
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Virtual memory mapped network interface for the SHRIMP multicomputer
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The Stanford FLASH multiprocessor
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Tempest and typhoon: user-level shared memory
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Integration of message passing and shared memory in the Stanford FLASH multiprocessor
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The performance impact of flexibility in the Stanford FLASH multiprocessor
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Fine-grain access control for distributed shared memory
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
EEL: machine-independent executable editing
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Efficient support for irregular applications on distributed-memory machines
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Remote queues: exposing message queues for optimization and atomicity
Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
The MIT Alewife machine: architecture and performance
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Efficient strategies for software-only protocols in shared-memory multiprocessors
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Application-specific protocols for user-level shared memory
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Cost-Effective Parallel Computing
Computer
A Case for NOW (Networks of Workstations)
IEEE Micro
IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
Software cache coherence for large scale multiprocessors
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Protected, user-level DMA for the SHRIMP network interface
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
CRL: High Performance All-Software Distributed Shared Memory
CRL: High Performance All-Software Distributed Shared Memory
Coherent network interfaces for fine-grain communication
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Shasta: a low overhead, software-only approach for supporting fine-grain shared memory
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Hiding communication latency and coherence overhead in software DSMs
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Trap-driven memory simulation with Tapeworm II
ACM Transactions on Modeling and Computer Simulation (TOMACS)
Trace-driven memory simulation: a survey
ACM Computing Surveys (CSUR)
Optimizing communication in HPF programs on fine-grain distributed shared memory
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Relaxed consistency and coherence granularity in DSM systems: a performance evaluation
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Efficient synchronization: let them eat QOLB
Proceedings of the 24th annual international symposium on Computer architecture
Coherence controller architectures for SMP-based CC-NUMA multiprocessors
Proceedings of the 24th annual international symposium on Computer architecture
Analytic evaluation of shared-memory systems with ILP processors
Proceedings of the 25th annual international symposium on Computer architecture
Retrospective: tempest and typhoon: user-level shared memory
25 years of the international symposia on Computer architecture (selected papers)
Hardware Support for Flexible Distributed Shared Memory
IEEE Transactions on Computers
Coherence Controller Architectures for Scalable Shared-Memory Multiprocessors
IEEE Transactions on Computers - Special issue on cache memory and related problems
Accelerating shared virtual memory via general-purpose network interface support
ACM Transactions on Computer Systems (TOCS)
Optimizing software cache-coherent cluster architectures
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Communication overlap in multi-tier parallel algorithms
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
An Application-Driven Study of Multicast Communication for Write Invalidation
The Journal of Supercomputing
Hardware Versus Software Implementation of COMA
ICPP '97 Proceedings of the international Conference on Parallel Processing
Trace-Driven Memory Simulation: A Survey
Performance Evaluation: Origins and Directions
Compilation and Runtime-Optimizations for Software Distributed Shared Memory
LCR '00 Selected Papers from the 5th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
Processor Mechanisms for Software Shared Memory
ISHPC '00 Proceedings of the Third International Symposium on High Performance Computing
The Thread-Based Protocol Engines for CC-NUMA Multiprocessors
ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Latency, Occupancy, and Bandwidth in DSM Multiprocessors: A Performance Evaluation
IEEE Transactions on Computers
SMTp: An Architecture for Next-generation Scalable Multi-threading
Proceedings of the 31st annual international symposium on Computer architecture
Journal of Systems Architecture: the EUROMICRO Journal
Evaluating scheduling policies for fine-grain communication protocols on a cluster of SMPs
Journal of Parallel and Distributed Computing
TMA: a trap-based memory architecture
Proceedings of the 20th annual international conference on Supercomputing
A case for low-complexity MP architectures
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Hi-index | 0.01 |
This paper investigates hardware support for fine-grain distributed shared memory (DSM) in networks of workstations. To reduce design time and implementation cost relative to dedicated DSM systems, we decouple the functional hardware components of DSM support, allowing greater use of off-the-shelf devices.We present two decoupled systems, Typhoon-0 and Typhoon-1. Typhoon-0 uses an off-the-shelf protocol processor and network interface; a custom access control device is the only DSM-specific hardware. To demonstrate the feasibility and simplicity of this access control device, we designed and built an FPGA-based version in under one year. Typhoon-1 also uses an off-the-shelf protocol processor, but integrates the network interface and access control devices for higher performance.We compare the performance of the two decoupled systems with two integrated systems via simulation. For six benchmarks on 32 nodes, Typhoon-0 ranges from 30% to 309% slower than the best integrated system, while Typhoon-1 ranges from 13% to 132% slower. Four of the six benchmarks achieve speedups of 12 to 18 on Typhoon-0 and 15 to 26 on Typhoon-1, compared with 19 to 35 on the best integrated system. Two benchmarks are hampered by high communication overheads, but selectively replacing shared-memory operations with message passing provides speedups of at least 16 on both decoupled systems. These speedups indicate that decoupled designs can potentially provide a cost-effective alternative to complex high-end DSM systems.