An analysis of on-chip interconnection networks for large-scale chip multiprocessors

Authors:
Daniel Sanchez;George Michelogiannakis;Christos Kozyrakis
Affiliations:
Stanford University, Stanford, CA;Stanford University, Stanford, CA;Stanford University, Stanford, CA
Venue:
ACM Transactions on Architecture and Code Optimization (TACO)
Year:
2010

Citing 37
Cited 15

Fat-trees: universal networks for hardware-efficient supercomputing

IEEE Transactions on Computers
Express Cubes: Improving the Performance of k-ary n-cube Interconnection Networks

IEEE Transactions on Computers
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Virtual-channel flow control

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Clock rate versus IPC: the end of the road for conventional microarchitectures

Proceedings of the 27th annual international symposium on Computer architecture
Route packets, not wires: on-chip inteconnection networks

Proceedings of the 38th annual Design Automation Conference
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Spider: A High-Speed Network Interconnect

IEEE Micro
Variability in Architectural Simulations of Multi-Threaded Workloads

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Networks on Chip: A New Paradigm for Systems on Chip Design

Proceedings of the conference on Design, automation and test in Europe
Power-driven Design of Router Microarchitectures in On-chip Networks

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Low-Latency Virtual-Channel Routers for On-Chip Networks

Proceedings of the 31st annual international symposium on Computer architecture
Managing Wire Delay in Large Chip-Multiprocessor Caches

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Microarchitectural Wire Management for Performance and Power in Partitioned Architectures

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
A low latency router supporting adaptivity for on-chip interconnects

Proceedings of the 42nd annual Design Automation Conference
Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors

Proceedings of the 32nd annual international symposium on Computer Architecture
Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling

Proceedings of the 32nd annual international symposium on Computer Architecture
A NUCA substrate for flexible CMP cache sharing

Proceedings of the 19th annual international conference on Supercomputing
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
3D Chip Stack Technology Using Through-Chip Interconnects

IEEE Design & Test
Implementation analysis of NoC: a MPSoC trace-driven approach

GLSVLSI '06 Proceedings of the 16th ACM Great Lakes symposium on VLSI
A survey of research and practices of Network-on-chip

ACM Computing Surveys (CSUR)
Cooperative Caching for Chip Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
Interconnect-Aware Coherence Protocols for Chip Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
Design tradeoffs for tiled CMP on-chip networks

Proceedings of the 20th annual international conference on Supercomputing
In-Network Cache Coherence

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
ASR: Adaptive Selective Replication for CMP Caches

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
NoC Topologies Exploration based on Mapping and Simulation Models

DSD '07 Proceedings of the 10th Euromicro Conference on Digital System Design Architectures, Methods and Tools
A Domain-Specific On-Chip Network Design for Large Scale Cache Systems

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Research Challenges for On-Chip Interconnection Networks

IEEE Micro
Flattened Butterfly Topology for On-Chip Networks

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Architecting Efficient Interconnects for Large Caches with CACTI 6.0

IEEE Micro
Toward Ideal On-Chip Communication Using Express Virtual Channels

IEEE Micro
Polymorphic On-Chip Networks

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Token flow control

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
ORION 2.0: a fast and accurate NoC power and area model for early-stage design space exploration

Proceedings of the Conference on Design, Automation and Test in Europe

Evaluating Bufferless Flow Control for On-chip Networks

NOCS '10 Proceedings of the 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip
A case for globally shared-medium on-chip interconnect

Proceedings of the 38th annual international symposium on Computer architecture
BOFAR: buffer occupancy factor based adaptive router for mesh NoCs

Proceedings of the 4th International Workshop on Network on Chip Architectures
Packet chaining: efficient single-cycle allocation for on-chip networks

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Enhancing effective throughput for transmission line-based bus

Proceedings of the 39th Annual International Symposium on Computer Architecture
Cost-effective contention avoidance in a CMP with shared memory controllers

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Characterization and cost-efficient selection of NoC topologies for general purpose CMPs

Proceedings of the 2013 Interconnection Network Architecture: On-Chip, Multi-Chip
ZSim: fast and accurate microarchitectural simulation of thousand-core systems

Proceedings of the 40th Annual International Symposium on Computer Architecture
A fast, source-synchronous ring-based network-on-chip design

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
An energy-aware online task mapping algorithm in NoC-based system

The Journal of Supercomputing
McRouter: multicast within a router for high performance network-on-chips

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
An analytical model for on-chip interconnects in multimedia embedded systems

ACM Transactions on Embedded Computing Systems (TECS) - Special Section on ESTIMedia'10
Design space exploration of on-chip ring interconnection for a CPU-GPU heterogeneous architecture

Journal of Parallel and Distributed Computing
Energy-aware fault-tolerant network-on-chips for addressing multiple traffic classes

Microprocessors & Microsystems
X-Network: An area-efficient and high-performance on-chip wormhole interconnect network

Microprocessors & Microsystems

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the number of cores of chip multiprocessors (CMPs) rapidly growing as technology scales down, connecting the different components of a CMP in a scalable and efficient way becomes increasingly challenging. In this article, we explore the architectural-level implications of interconnection network design for CMPs with up to 128 fine-grain multithreaded cores. We evaluate and compare different network topologies using accurate simulation of the full chip, including the memory hierarchy and interconnect, and using a diverse set of scientific and engineering workloads. We find that the interconnect has a large impact on performance, as it is responsible for 60% to 75% of the miss latency. Latency, and not bandwidth, is the primary performance constraint, since, even with many threads per core and workloads with high miss rates, networks with enough bandwidth can be efficiently implemented for the system scales we consider. From the topologies we study, the flattened butterfly consistently outperforms the mesh and fat tree on all workloads, leading to performance advantages of up to 22%. We also show that considering interconnect and memory hierarchy together when designing large-scale CMPs is crucial, and neglecting either of the two can lead to incorrect conclusions. Finally, the effect of the interconnect on overall performance becomes more important as the number of cores increases, making interconnection choices especially critical when scaling up.