Interconnect design considerations for large NUCA caches

Authors:
Naveen Muralimanohar;Rajeev Balasubramonian
Affiliations:
Unversity of Utah, Salt Lake City, UT;University of Utah, Salt Lake City, UT
Venue:
Proceedings of the 34th annual international symposium on Computer architecture
Year:
2007

Citing 29
Cited 20

Inexpensive implementations of set-associativity

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Digital systems engineering

Digital systems engineering
Automatically characterizing large scale program behavior

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Will Physical Scalability Sabotage Performance Gains?

Computer
A Delay Model for Router Microarchitectures

IEEE Micro
The Alpha 21364 Network Architecture

IEEE Micro
Virtual-Channel Flow Control

IEEE Transactions on Parallel and Distributed Systems
A Power Model for Routers: Modeling Alpha 21364 and InfiniBand Routers

IEEE Micro
Orion: a power-performance simulator for interconnection networks

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
A Delay Model and Speculative Architecture for Pipelined Routers

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Distance Associativity for High-Performance Energy-Efficient Non-Uniform Cache Architectures

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
TLC: Transmission Line Caches

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Interconnect-power dissipation in a microprocessor

Proceedings of the 2004 international workshop on System level interconnect prediction
Principles and Practices of Interconnection Networks

Principles and Practices of Interconnection Networks
Low-Latency Virtual-Channel Routers for On-Chip Networks

Proceedings of the 31st annual international symposium on Computer architecture
Managing Wire Delay in Large Chip-Multiprocessor Caches

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Microarchitectural Wire Management for Performance and Power in Partitioned Architectures

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Montecito: A Dual-Core, Dual-Thread Itanium Processor

IEEE Micro
Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors

Proceedings of the 32nd annual international symposium on Computer Architecture
Adaptive Mechanisms and Policies for Managing Cache Hierarchies in Chip Multiprocessors

Proceedings of the 32nd annual international symposium on Computer Architecture
Optimizing Replication, Communication, and Capacity Allocation in CMPs

Proceedings of the 32nd annual international symposium on Computer Architecture
Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling

Proceedings of the 32nd annual international symposium on Computer Architecture
A NUCA substrate for flexible CMP cache sharing

Proceedings of the 19th annual international conference on Supercomputing
Design and Management of 3D Chip Multiprocessors Using Network-in-Memory

Proceedings of the 33rd annual international symposium on Computer Architecture
Cooperative Caching for Chip Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
Interconnect-Aware Coherence Protocols for Chip Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
A thermally-aware performance analysis of vertically integrated (3-D) processor-memory hierarchy

Proceedings of the 43rd annual Design Automation Conference
ASR: Adaptive Selective Replication for CMP Caches

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture

Analysis of static and dynamic energy consumption in NUCA caches: initial results

MEDEA '07 Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture
Best of both worlds: A bus enhanced NoC (BENoC)

NOCS '09 Proceedings of the 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip
Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
A case for dynamic frequency tuning in on-chip networks

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
A cost-effective load-balancing policy for tile-based, massive multi-core packet processors

ACM Transactions on Embedded Computing Systems (TECS)
NCID: a non-inclusive cache, inclusive directory architecture for flexible and efficient cache hierarchies

Proceedings of the 7th ACM international conference on Computing frontiers
LRU-PEA: a smart replacement policy for non-uniform cache architectures on chip multiprocessors

ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
The auction: optimizing banks usage in Non-Uniform Cache Architectures

Proceedings of the 24th ACM International Conference on Supercomputing
Back Suction: Service Guarantees for Latency-Sensitive On-chip Networks

NOCS '10 Proceedings of the 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip
Light NUCA: a proposal for bridging the inter-cache latency gap

Proceedings of the Conference on Design, Automation and Test in Europe
Efficient throughput-guarantees for latency-sensitive networks-on-chip

Proceedings of the 2010 Asia and South Pacific Design Automation Conference
RAFT: A router architecture with frequency tuning for on-chip networks

Journal of Parallel and Distributed Computing
A case for globally shared-medium on-chip interconnect

Proceedings of the 38th annual international symposium on Computer architecture
A design space exploration of transmission-line links for on-chip interconnect

Proceedings of the 17th IEEE/ACM international symposium on Low-power electronics and design
Design and evaluation of low latency interconnection networks for real-time many-core embedded systems

Computers and Electrical Engineering
Enhancing effective throughput for transmission line-based bus

Proceedings of the 39th Annual International Symposium on Computer Architecture
Addressing End-to-End Memory Access Latency in NoC-Based Multicores

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Replacement techniques for dynamic NUCA cache designs on CMPs

The Journal of Supercomputing
LP-NUCA: networks-in-cache for high-performance low-power embedded processors

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
VBON: Toward efficient on-chip networks via hierarchical virtual bus

Microprocessors & Microsystems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The ever increasing sizes of on-chip caches and the growing domination of wire delay necessitate significant changes to cache hierarchy design methodologies. Many recent proposals advocate splitting the cache into a large number of banks and employing a network-on-chip (NoC) to allow fast access to nearby banks (referred to as Non-Uniform Cache Architectures--NUCA). Most studies on NUCA organizations have assumed a generic NoC and focused on logical policies for cache block placement, movement, and search. Since wire/router delay and power are major limiting factors in modern processors, this work focuses on interconnect design and its influence on NUCA performance and power. We extend the widely-used CACTI cache modeling tool to take network design parameters into account. With these overheads appropriately accounted for, the optimal cache organization is typically very different from that assumed in prior NUCA studies. To alleviate the interconnect delay bottleneck, we propose novel cache access optimizations that introduce heterogeneity within the inter-bank network. The careful consideration of interconnect choices for a large cache results in a 51% performance improvement over a baseline generic NoC and the introduction of heterogeneity within the network yields an additional 11-15% performance improvement.