Comparative performance evaluation of cache-coherent NUMA and COMA architectures
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Closing the window of vulnerability in multiphase memory transactions
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Evaluation of release consistent software distributed shared memory on emerging network technology
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Virtual memory mapped network interface for the SHRIMP multicomputer
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The Stanford FLASH multiprocessor
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Software-extended coherent shared memory: performance and cost
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Tempest and typhoon: user-level shared memory
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Fine-grain access control for distributed shared memory
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
COMA-F: a non-hierarchical cache only memory architecture
COMA-F: a non-hierarchical cache only memory architecture
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Efficient strategies for software-only protocols in shared-memory multiprocessors
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Shasta: a low overhead, software-only approach for supporting fine-grain shared memory
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Reactive NUMA: a design for unifying S-COMA and CC-NUMA
Proceedings of the 24th annual international symposium on Computer architecture
Tolerating late memory traps in ILP processors
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Clock rate versus IPC: the end of the road for conventional microarchitectures
Proceedings of the 27th annual international symposium on Computer architecture
Parallel Computer Architecture: A Hardware/Software Approach
Parallel Computer Architecture: A Hardware/Software Approach
Software cache coherence for large scale multiprocessors
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Reducing Remote Conflict Misses: NUMA with Remote Cache versus COMA
HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Latency, Occupancy, and Bandwidth in DSM Multiprocessors: A Performance Evaluation
IEEE Transactions on Computers
Reflections on the memory wall
Proceedings of the 1st conference on Computing frontiers
CCGRID '05 Proceedings of the Fifth IEEE International Symposium on Cluster Computing and the Grid (CCGrid'05) - Volume 2 - Volume 02
Comparing latency-tolerance techniques for software DSM systems
IEEE Transactions on Parallel and Distributed Systems
Hi-index | 0.00 |
Distributed Shared-Memory (DSM) systems are shared-memory multiprocessor architectures in which each processor node contains a partition of the shared memory. In hybrid DSM systems coherence among caches is maintained by a software-implemented coherence protocol relying on some hardware support. Hardware support is provided to satisfy every node hit (the common case) and software is invoked only for accesses to remote nodes. In this paper we compare the design and performance of four hybrid distributed shared memory (DSM) organizations by detailed simulation of the same hardware platform. We have implemented the software protocol handlers for the four architectures. The handlers are written in C and assembly code. Coherence transactions are executed in trap and interrupt handlers. Together with the application, the handlers are executed in full detail in execution-driven simulations of six complete benchmarks with coarse-grain and fine-grain sharing. We relate our experience implementing and simulating the software protocols for the four architectures. Because the overhead of remote accesses is very high in hybrid systems, the system of choice is different than for purely hardware systems.