A case for low-complexity MP architectures

Authors:
Håkan Zeffer;Erik Hagersten
Affiliations:
Uppsala University, Uppsala, Sweden;Uppsala University, Uppsala, Sweden
Venue:
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Year:
2007

Citing 27
Cited 2

Implementation and performance of Munin

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
The Stanford Dash Multiprocessor

Computer
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Tempest and typhoon: user-level shared memory

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Fine-grain access control for distributed shared memory

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Simultaneous multithreading: maximizing on-chip parallelism

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Decoupled hardware support for distributed shared memory

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Informing memory operations: providing memory performance feedback in modern processors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
STiNG: a CC-NUMA computer system for the commercial marketplace

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
The case for a single-chip multiprocessor

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture
The directory-based cache coherence protocol for the DASH multiprocessor

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Piranha: a scalable architecture based on single-chip multiprocessing

Proceedings of the 27th annual international symposium on Computer architecture
Speculative lock elision: enabling highly concurrent multithreaded execution

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Simics: A Full System Simulation Platform

Computer
Sequential Hardware Prefetching in Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
An Evaluation of a Commercial CC-NUMA Architecture: The CONVEX Exemplar SPP1200

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Hyperthreading Technology in the Netburst Microarchitecture

IEEE Micro
The evolution of the HP/Convex Exemplar

COMPCON '97 Proceedings of the 42nd IEEE International Computer Conference
Design Alternatives for Shared Memory Multiprocessors

HIPC '98 Proceedings of the Fifth International Conference on High Performance Computing
WildFire: A Scalable Path for SMPs

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors

Proceedings of the 30th annual international symposium on Computer architecture
SMTp: An Architecture for Next-generation Scalable Multi-threading

Proceedings of the 31st annual international symposium on Computer architecture
Niagara: A 32-Way Multithreaded Sparc Processor

IEEE Micro
TMA: a trap-based memory architecture

Proceedings of the 20th annual international conference on Supercomputing
POWER4 system microarchitecture

IBM Journal of Research and Development

Micro-benchmarks for cluster OpenMP implementations: memory consistency costs

IWOMP'08 Proceedings of the 4th international conference on OpenMP in a new era of parallelism
Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks

Proceedings of the 38th annual international symposium on Computer architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Advances in semiconductor technology have driven shared-memory servers toward processors with multiple cores per die and multiple threads per core. This paper presents simple hardware primitives enabling flexible and low-complexity multi-chip designs supporting an efficient inter-node coherence protocol implemented in software. We argue that our primitives and the example design presented in this paper have lower hardware overhead, have easier (and later) verification requirements, and provide the opportunity for flexible coherence protocols and simpler protocol bug corrections than traditional designs. Our evaluation is based on detailed full-system simulations of modern chip-multiprocessors and both commercial and HPC workloads. We compare a low-complexity system based on the proposed primitives with aggressive hardware multi-chip shared-memory systems and show that the performance is competitive across a large design space.