Locality-oblivious cache organization leveraging single-cycle multi-hop NoCs

Authors:
Woo-Cheol Kwon;Tushar Krishna;Li-Shiuan Peh
Affiliations:
MIT, Cambridge, MA, USA;Intel Corporation, Hudson, MA, USA;MIT, Cambridge, MA, USA
Venue:
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Year:
2014

Citing 34
Cited 0

The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
A Hierarchical Directory Scheme for Large-Scale Cache-Coherent Multipmcessors

IPPS '92 Proceedings of the 6th International Parallel Processing Symposium
Token coherence: decoupling performance and correctness

Proceedings of the 30th annual international symposium on Computer architecture
Distance Associativity for High-Performance Energy-Efficient Non-Uniform Cache Architectures

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
TLC: Transmission Line Caches

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Improving Multiple-CMP Systems Using Token Coherence

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors

Proceedings of the 32nd annual international symposium on Computer Architecture
Optimizing Replication, Communication, and Capacity Allocation in CMPs

Proceedings of the 32nd annual international symposium on Computer Architecture
Microarchitecture of a High-Radix Router

Proceedings of the 32nd annual international symposium on Computer Architecture
A NUCA substrate for flexible CMP cache sharing

Proceedings of the 19th annual international conference on Supercomputing
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
The BlackWidow High-Radix Clos Network

Proceedings of the 33rd annual international symposium on Computer Architecture
Cooperative Caching for Chip Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
ASR: Adaptive Selective Replication for CMP Caches

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Virtual hierarchies to support server consolidation

Proceedings of the 34th annual international symposium on Computer architecture
Flattened butterfly: a cost-efficient topology for high-radix networks

Proceedings of the 34th annual international symposium on Computer architecture
On-Chip Interconnection Architecture of the Tile Processor

IEEE Micro
A 5-GHz Mesh Interconnect for a Teraflops Processor

IEEE Micro
Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture

IEEE Micro
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Leveraging on-chip networks for data cache migration in chip multiprocessors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Characterization of Equalized and Repeated Interconnects for NoC Applications

IEEE Design & Test
Dynamic cache clustering for chip multiprocessors

Proceedings of the 23rd international conference on Supercomputing
Reactive NUCA: near-optimal block placement and replication in distributed caches

Proceedings of the 36th annual international symposium on Computer architecture
A communication characterisation of Splash-2 and Parsec

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks

Proceedings of the 38th annual international symposium on Computer architecture
CloudCache: Expanding and shrinking private caches

HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
Approaching the theoretical limits of a mesh NoC with a 16-node chip prototype in 45nm SOI

Proceedings of the 49th Annual Design Automation Conference
DSENT - A Tool Connecting Emerging Photonics with Electronics for Opto-Electronic Networks-on-Chip Modeling

NOCS '12 Proceedings of the 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip
Multi-Core Cache Hierarchies

Multi-Core Cache Hierarchies
SMART: a single-cycle reconfigurable NoC for SoC applications

Proceedings of the Conference on Design, Automation and Test in Europe
Breaking the on-chip latency barrier using SMART

HPCA '13 Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Locality has always been a critical factor in on-chip data placement on CMPs as accessing further-away caches has in the past been more costly than accessing nearby ones. Substantial research on locality-aware designs have thus focused on keeping a copy of the data private. However, this complicatesthe problem of data tracking and search/invalidation; tracking the state of a line at all on-chip caches at a directory or performing full-chip broadcasts are both non-scalable and extremely expensive solutions. In this paper, we make the case for Locality-Oblivious Cache Organization (LOCO), a CMP cache organization that leverages the on-chip network to create virtual single-cycle paths between distant caches, thus redefining the notion of locality. LOCO is a clustered cache organization, supporting both homogeneous and heterogeneous cluster sizes, and provides near single-cycle accesses to data anywhere within the cluster, just like a private cache. Globally, LOCO dynamically creates a virtual mesh connecting all the clusters, and performs an efficient global data search and migration over this virtual mesh, without having to resort to full-chip broadcasts or perform expensive directory lookups. Trace-driven and full system simulations running SPLASH-2 and PARSEC benchmarks show that LOCO improves application run time by up to 44.5% over baseline private and shared cache.