Implementation and performance of Munin
SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
The Stanford Dash Multiprocessor
Computer
The Stanford FLASH multiprocessor
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Tempest and typhoon: user-level shared memory
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Fine-grain access control for distributed shared memory
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Simultaneous multithreading: maximizing on-chip parallelism
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Decoupled hardware support for distributed shared memory
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Informing memory operations: providing memory performance feedback in modern processors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
STiNG: a CC-NUMA computer system for the commercial marketplace
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
The case for a single-chip multiprocessor
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
The SGI Origin: a ccNUMA highly scalable server
Proceedings of the 24th annual international symposium on Computer architecture
The directory-based cache coherence protocol for the DASH multiprocessor
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Piranha: a scalable architecture based on single-chip multiprocessing
Proceedings of the 27th annual international symposium on Computer architecture
Speculative lock elision: enabling highly concurrent multithreaded execution
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Sequential Hardware Prefetching in Shared-Memory Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
An Evaluation of a Commercial CC-NUMA Architecture: The CONVEX Exemplar SPP1200
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
The evolution of the HP/Convex Exemplar
COMPCON '97 Proceedings of the 42nd IEEE International Computer Conference
Design Alternatives for Shared Memory Multiprocessors
HIPC '98 Proceedings of the Fifth International Conference on High Performance Computing
WildFire: A Scalable Path for SMPs
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Proceedings of the 30th annual international symposium on Computer architecture
SMTp: An Architecture for Next-generation Scalable Multi-threading
Proceedings of the 31st annual international symposium on Computer architecture
TMA: a trap-based memory architecture
Proceedings of the 20th annual international conference on Supercomputing
POWER4 system microarchitecture
IBM Journal of Research and Development
Micro-benchmarks for cluster OpenMP implementations: memory consistency costs
IWOMP'08 Proceedings of the 4th international conference on OpenMP in a new era of parallelism
Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks
Proceedings of the 38th annual international symposium on Computer architecture
Hi-index | 0.00 |
Advances in semiconductor technology have driven shared-memory servers toward processors with multiple cores per die and multiple threads per core. This paper presents simple hardware primitives enabling flexible and low-complexity multi-chip designs supporting an efficient inter-node coherence protocol implemented in software. We argue that our primitives and the example design presented in this paper have lower hardware overhead, have easier (and later) verification requirements, and provide the opportunity for flexible coherence protocols and simpler protocol bug corrections than traditional designs. Our evaluation is based on detailed full-system simulations of modern chip-multiprocessors and both commercial and HPC workloads. We compare a low-complexity system based on the proposed primitives with aggressive hardware multi-chip shared-memory systems and show that the performance is competitive across a large design space.