XPoint cache: scaling existing bus-based coherence protocols for 2D and 3D many-core systems

Authors:
Ronald G. Dreslinski;Thomas Manville;Korey Sewell;Reetuparna Das;Nathaniel Pinckney;Sudhir Satpathy;David Blaauw;Dennis Sylvester;Trevor Mudge
Affiliations:
University of Michigan, Ann Arbor, MI, USA;University of Michigan, Ann Arbor, MI, USA;University of Michigan, Ann Arbor, MI, USA;University of Michigan, Ann Arbor, MI, USA;University of Michigan, Ann Arbor, MI, USA;University of Michigan, Ann Arbor, MI, USA;University of Michigan, Ann Arbor, MI, USA;University of Michigan, Ann Arbor, MI, USA;University of Michigan, Ann Arbor, MI, USA
Venue:
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Year:
2012

Citing 30
Cited 0

Throughput Analysis of Cache-Based Multiprocessors with Multiple Buses

IEEE Transactions on Computers
Scalable Shared-Memory Multiprocessor Architectures

Computer
Scaling shared-bus multi-processors with multiple buses and shared caches: a performance study

Microprocessors & Microsystems
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Verification techniques for cache coherence protocols

ACM Computing Surveys (CSUR)
The design and development of a very high speed system bus—the encore Mutlimax nanobus

ACM '86 Proceedings of 1986 ACM Fall joint computer conference
An interleaved cache clustered VLIW processor

ICS '02 Proceedings of the 16th international conference on Supercomputing
Parallel Computer Architecture: A Hardware/Software Approach

Parallel Computer Architecture: A Hardware/Software Approach
Verifying a Multiprocessor Cache Controller Using Random Test Generation

IEEE Design & Test
Starfire: Extending the SMP Envelope

IEEE Micro
Formal Design of Cache Memory Protocols in IBM

Formal Methods in System Design
Protocol Verification as a Hardware Design Aid

ICCD '92 Proceedings of the 1991 IEEE International Conference on Computer Design on VLSI in Computer & Processors
Automatic Deductive Verification with Invisible Invariants

TACAS 2001 Proceedings of the 7th International Conference on Tools and Algorithms for the Construction and Analysis of Systems
The AMD Opteron Processor for Multiprocessor Servers

IEEE Micro
Variability in Architectural Simulations of Multi-Threaded Workloads

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Temperature-aware microarchitecture

Proceedings of the 30th annual international symposium on Computer architecture
Multiple-bus, scalable, shared-memory multiprocessors

Multiple-bus, scalable, shared-memory multiprocessors
Design and Management of 3D Chip Multiprocessors Using Network-in-Memory

Proceedings of the 33rd annual international symposium on Computer Architecture
POWER5 System microarchitecture

IBM Journal of Research and Development - POWER5 and packaging
The M5 Simulator: Modeling Networked Systems

IEEE Micro
PicoServer: using 3D stacking technology to enable a compact energy efficient chip multiprocessor

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Die Stacking (3D) Microarchitecture

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
An 8-core, 64-thread, 64-bit power efficient sparc soc (niagara2)

Proceedings of the 2007 international symposium on Physical design
A novel dimensionally-decomposed router for on-chip communication in 3D architectures

Proceedings of the 34th annual international symposium on Computer architecture
Processor Design in 3D Die-Stacking Technologies

IEEE Micro
Error Detection via Online Checking of Cache Coherence with Token Coherence Signatures

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
3-D topologies for networks-on-chip

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
A communication characterisation of Splash-2 and Parsec

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Fractal Coherence: Scalably Verifiable Cache Coherence

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Optimum positioning of interleaved repeaters in bidirectional buses

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

With multi-core processors now mainstream, the shift to many-core processors poses a new set of design challenges. In particular, the scalability of coherence protocols remains a significant challenge. While complex Network-on-Chip interconnect fabrics have been proposed and in some cases implemented, most of industry has slowly evolved existing coherence solutions to meet the needs of a growing number of cores. Industries' slow adoption of Network-on-Chip designs is in large part due to the significant effort needed to design and verify the system. However, simply scaling bus-based coherence is not straightforward either because of increased contention and latency on the bus for large core counts. This paper proposes a new architecture, XPoint, which does not need to modify existing bus-based snooping coherence protocols to scale to 64 core systems. XPoint employs interleaved cache structures with detailed floorplaning and system analysis to reduce contention at high core counts. Results show that the XPoint system achieves, on average, a 28x and 35x over a single core design on the Splash2 benchmarks for a 32 and 64 core system respectively (a 1.6x improvement over a 64 core conventional bus). XPoint is also evaluated as a 3D stacked system to reduce further bus latency. Results show a 29x and 45x speedup for 32 and 64 core systems respectively (a 2.1x improvement over a 64 core conventional bus and within 8% of the speedup of a 64 core system with an ideal interconnect). Measurements also show that the XPoint system decreases bus contention of a 64 core system to only 13% higher than that of an 8-core design (a 29x improvement over a 64 core conventional bus).