WAYPOINT: scaling coherence to thousand-core architectures

Authors:
John H. Kelm;Matthew R. Johnson;Steven S. Lumettta;Sanjay J. Patel
Affiliations:
University of Illinois, Urbana, IL, USA;University of Illinois, Urbana, IL, USA;University of Illinois, Urbana, IL, USA;University of Illinois, Urbana, IL, USA
Venue:
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Year:
2010

Citing 35
Cited 6

LimitLESS directories: A scalable cache coherence scheme

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Cooperative shared memory: software and hardware for scalable multiprocessors

ACM Transactions on Computer Systems (TOCS)
The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture
An empirical evaluation of two memory-efficient directory methods

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Memory access scheduling

Proceedings of the 27th annual international symposium on Computer architecture
Piranha: a scalable architecture based on single-chip multiprocessing

Proceedings of the 27th annual international symposium on Computer architecture
IEEE Standard for Scalable Coherent Interface, Science: IEEE Std. 1596-1992

IEEE Standard for Scalable Coherent Interface, Science: IEEE Std. 1596-1992
Using cache memory to reduce processor-memory traffic

ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
Design and Performance of Directory Caches for Scalable Shared Memory Multiprocessors

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
WildFire: A Scalable Path for SMPs

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Token coherence: decoupling performance and correctness

Proceedings of the 30th annual international symposium on Computer architecture
Computer Architecture: A Quantitative Approach

Computer Architecture: A Quantitative Approach
JETTY: Filtering Snoops for Reduced Energy Consumption in SMP Servers

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Characterization and Evaluation of Cache Hierarchies for Web Servers

World Wide Web
Niagara: A 32-Way Multithreaded Sparc Processor

IEEE Micro
RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence

Proceedings of the 32nd annual international symposium on Computer Architecture
Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking

Proceedings of the 32nd annual international symposium on Computer Architecture
Coarse-Grain Coherence Tracking: RegionScout and Region Coherence Arrays

IEEE Micro
Chip multiprocessing and the cell broadband engine

Proceedings of the 3rd conference on Computing frontiers
Virtual hierarchies to support server consolidation

Proceedings of the 34th annual international symposium on Computer architecture
Carbon: architectural support for fine-grained parallelism on chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
A New Solution to Coherence Problems in Multicache Systems

IEEE Transactions on Computers
Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
Scalable Parallel Programming with CUDA

Queue - GPU Computing
NVIDIA Tesla: A Unified Graphics and Computing Architecture

IEEE Micro
To Snoop or Not to Snoop: Evaluation of Fine-Grain and Coarse-Grain Snoop Filtering Techniques

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Roofline: an insightful visual performance model for multicore architectures

Communications of the ACM - A Direct Path to Dependable Software
Token tenure: PATCHing token counting using directory-based cache coherence

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Rigel: an architecture and scalable programming interface for a 1000-core accelerator

Proceedings of the 36th annual international symposium on Computer architecture
A Task-Centric Memory Model for Scalable Accelerator Architectures

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
In-network coherence filtering: snoopy coherence without broadcasts

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
A tagless coherence directory

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
A Task-Centric Memory Model for Scalable Accelerator Architectures

IEEE Micro

A composite and scalable cache coherence protocol for large scale CMPs

Proceedings of the international conference on Supercomputing
Reducing energy and increasing performance with traffic optimization in many-core systems

Proceedings of the System Level Interconnect Prediction Workshop
Manager-client pairing: a framework for implementing coherence hierarchies

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Hardware transactional memory for GPU architectures

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
The case for a scalable coherence protocol for complex on-chip cache hierarchies in many core systems

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
On-chip traffic regulation to reduce coherence protocol cost on a microthreaded many-core architecture with distributed caches

ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we evaluate a set of coherence architectures in the context of a 1024-core chip multiprocessor (CMP) tailored to throughput-oriented parallel workloads. Based on our analysis, we develop and evaluate two techniques for scaling coherence to thousand-core CMPs. We find that a broadcast-based probe filtering scheme provides reasonable performance up to 128 cores for some benchmarks, but is not generally scalable. We propose a broadcast-collective network for accelerating probe filter misses, which extends scalability but falls short of supporting 1024 cores. We find that a sparse directory with an invalidate-on-evict policy can work well for many throughput-oriented workloads. However, the on-die structures required to achieve good performance carry a large performance and power overhead. To achieve thousand-core scalability with smaller and less associative sparse directories, we introduce WayPoint, a mechanism that increases directory associativity and capacity dynamically. Using less than 3% of total die area, Way-Point achieves performance within 4% of an infinitely large on-die directory.