NoC-aware cache design for multithreaded execution on tiled chip multiprocessors

Authors:
Ahmed K. Abousamra;Alex K. Jones;Rami G. Melhem
Affiliations:
University of Pittsburgh, Pittsburgh, PA;University of Pittsburgh, Pittsburgh, PA;University of Pittsburgh, Pittsburgh, PA
Venue:
Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Year:
2011

Citing 29
Cited 0

The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture
The directory-based cache coherence protocol for the DASH multiprocessor

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Characterization of Communication Patterns in Message-Passing Parallel Scientific Application Programs

CANPC '98 Proceedings of the Second International Workshop on Network-Based Parallel Computing: Communication, Architecture, and Applications
Toward high communication performance through compiled communications on a circuit switched interconnection network

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Managing Wire Delay in Large Chip-Multiprocessor Caches

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors

Proceedings of the 32nd annual international symposium on Computer Architecture
Optimizing Replication, Communication, and Capacity Allocation in CMPs

Proceedings of the 32nd annual international symposium on Computer Architecture
A NUCA substrate for flexible CMP cache sharing

Proceedings of the 19th annual international conference on Supercomputing
On the Feasibility of Optical Circuit Switching for High Performance Computing Systems

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Analyzing Ultra-Scale Application Communication Requirements for a Reconfigurable Hybrid Interconnect

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Cooperative Caching for Chip Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
Design tradeoffs for tiled CMP on-chip networks

Proceedings of the 20th annual international conference on Supercomputing
ASR: Adaptive Selective Replication for CMP Caches

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Proximity-aware directory-based coherence for multi-core processor architectures

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Express virtual channels: towards the ideal interconnection fabric

Proceedings of the 34th annual international symposium on Computer architecture
The Power of Priority: NoC Based Distributed Cache Coherency

NOCS '07 Proceedings of the First International Symposium on Networks-on-Chip
A Hybrid Ring/Mesh Interconnect for Network-on-Chip Using Hierarchical Rings for Global Routing

NOCS '07 Proceedings of the First International Symposium on Networks-on-Chip
Flattened Butterfly Topology for On-Chip Networks

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Utilizing shared data in chip multiprocessors with the Nahalal architecture

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Circuit-Switched Coherence

NOCS '08 Proceedings of the Second ACM/IEEE International Symposium on Networks-on-Chip
A novel migration-based NUCA design for chip multiprocessors

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Nonuniform Cache Architectures for Wire-Delay Dominated On-Chip Caches

IEEE Micro
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Reactive NUCA: near-optimal block placement and replication in distributed caches

Proceedings of the 36th annual international symposium on Computer architecture
Winning with Pinning in NoC

HOTI '09 Proceedings of the 2009 17th IEEE Symposium on High Performance Interconnects
Low-Power Reconfigurable Network Architecture for On-Chip Photonic Interconnects

HOTI '09 Proceedings of the 2009 17th IEEE Symposium on High Performance Interconnects

Quantified Score

Hi-index	0.00

Visualization

Abstract

In chip multiprocessors (CMPs), data access latency depends on the memory hierarchy organization, the on-chip interconnect (NoC), and the running workload. Reducing data access latency is vital to achieving performance improvements and scalability of threaded applications. Multithreaded applications generally exhibit sharing of data among the program threads, which generates coherence and data traffic on the NoC. Many NoC designs exploit communication locality to reduce communication latency by configuring special fast paths on which communication is faster than the rest of the NoC. Communication patterns are directly affected by the cache organization. However, many cache organizations are designed in isolation of the underlying NoC or assume a simple NoC design, thus possibly missing optimization opportunities. In this work, we present a NoC-aware cache design that creates a symbiotic relationship between the NoC and cache to reduce data access latency, improve utilization of cache capacity, and improve overall system performance. Specifically, considering a NoC designed to exploit communication locality, we design a Unique Private caching scheme that promotes locality in communication patterns. In turn, the NoC exploits this locality to allow fast access to remote data, thus reducing the need for data replication and allowing better utilization of cache capacity. The Unique Private cache stores the data mostly used by a processor core in its locally accessible cache bank, while leveraging dedicated high speed circuits in the interconnect to provide remote cores with fast access to shared data. Simulations of a suite of scientific and commercial workloads show that our proposed design achieves a speedup of 14% and 16% on a 16-core and a 64-core CMP, respectively, over the state-of-the-art NoC-Cache co-designed system which also exploits communication locality in multithreaded applications.