An Architecture for High-Performance Scalable Shared-Memory Multiprocessors Exploiting On-Chip Integration

Authors:
Manuel E. Acacio;Jose Gonzalez;Jose M. Garcia;Jose Duato
Affiliations:
-;IEEE Computer Society;IEEE;IEEE
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
2004

Citing 30
Cited 7

An evaluation of directory schemes for cache coherence

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
LimitLESS directories: A scalable cache coherence scheme

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The Stanford Dash Multiprocessor

Computer
SPLASH: Stanford parallel applications for shared-memory

ACM SIGARCH Computer Architecture News
Cache Invalidation Patterns in Shared-Memory Multiprocessors

IEEE Transactions on Computers
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
STiNG: a CC-NUMA computer system for the commercial marketplace

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture
An empirical evaluation of two memory-efficient directory methods

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Selective, accurate, and timely self-invalidation using last-touch prediction

Proceedings of the 27th annual international symposium on Computer architecture
Piranha: a scalable architecture based on single-chip multiprocessing

Proceedings of the 27th annual international symposium on Computer architecture
Timestamp snooping: an approach for extending SMPs

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Parallel Computer Architecture: A Hardware/Software Approach

Parallel Computer Architecture: A Hardware/Software Approach
Boosting the Performance of Shared Memory Multiprocessors

Computer
Multiprocessors Should Support Simple Memory-Consistency Models

Computer
Starfire: Extending the SMP Envelope

IEEE Micro
The Stanford Hydra CMP

IEEE Micro
Introduction to the Special Section on High Performance Memory Systems

IEEE Transactions on Computers
A Novel Approach to Reduce L2 Miss Latency in Shared-Memory Multiprocessors

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Owner prediction for accelerating cache-to-cache transfer misses in a cc-NUMA architecture

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
An overview of the BlueGene/L Supercomputer

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Design and Performance of Directory Caches for Scalable Shared Memory Multiprocessors

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Improving CC-NUMA Performance Using Instruction-Based Prediction

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Switch Cache: A Framework for Improving the Remote Memory Access Latency of CC-NUMA Multiprocessors

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Using Switch Directories to Speed Up Cache-to-Cache Transfers in CC-NUMA Multiprocessors

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors

Proceedings of the 30th annual international symposium on Computer architecture
A New Scalable Directory Architecture for Large-Scale Multiprocessors

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
High-throughput coherence control and hardware messaging in everest

IBM Journal of Research and Development
POWER4 system microarchitecture

IBM Journal of Research and Development

An efficient cache design for scalable glueless shared-memory multiprocessors

Proceedings of the 3rd conference on Computing frontiers
A consistency architecture for hierarchical shared caches

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Two proposals for the inclusion of directory information in the last-level private caches of glueless shared-memory multiprocessors

Journal of Parallel and Distributed Computing
Scalable directory architecture for distributed shared memory chip multiprocessors

ACM SIGARCH Computer Architecture News
An Efficient Lightweight Shared Cache Design for Chip Multiprocessors

APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
A Novel Cache Organization for Tiled Chip Multiprocessor

APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
A novel lightweight directory architecture for scalable shared-memory multiprocessors

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent technology improvements allow multiprocessor designers to put some key components inside the processor chip, such as the memory controller, the coherence hardware, and the network interface/router. In this paper, we exploit such integration scale, presenting a novel node architecture aimed at reducing the long L2 miss latencies and the memory overhead of using directories that characterize cc-NUMA machines and limit their scalability. Our proposal replaces the traditional directory with a novel three-level directory architecture, as well as it adds a small shared data cache to each of the nodes of a multiprocessor system. Due to their small size, the first-level directory and the shared data cache are integrated into the processor chip in every node, which enhances performance by saving accesses to the slower main memory. Scalability is guaranteed by having the second and third-level directories out of the processor chip and using compressed data structures. A taxonomy of the L2 misses, according to the actions performed by the directory to satisfy them, is also presented. Using execution-driven simulations, we show that significant latency reductions can be obtained by using the proposed node architecture, which translates into reductions of more than 30 percent in several cases in the application execution time.