Introducing memory into the switch elements of multiprocessor interconnection networks
ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
The Stanford Dash Multiprocessor
Computer
The DASH prototype: implementation and performance
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Generating representative Web workloads for network and server performance evaluation
SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Multicast snooping: a new coherence method using a multicast address network
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
CACHET: an adaptive cache coherence protocol for distributed shared-memory systems
ICS '99 Proceedings of the 13th international conference on Supercomputing
Piranha: a scalable architecture based on single-chip multiprocessing
Proceedings of the 27th annual international symposium on Computer architecture
A generic architecture for on-chip packet-switched interconnections
DATE '00 Proceedings of the conference on Design, automation and test in Europe
Parallel Computer Architecture: A Hardware/Software Approach
Parallel Computer Architecture: A Hardware/Software Approach
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Performance Evaluation of the Slotted Ring Multiprocessor
IEEE Transactions on Computers
Efficient synchronization for nonuniform communication architectures
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Token coherence: decoupling performance and correctness
Proceedings of the 30th annual international symposium on Computer architecture
A Network on Chip Architecture and Design Methodology
ISVLSI '02 Proceedings of the IEEE Computer Society Annual Symposium on VLSI
xpipes: a Latency Insensitive Parameterized Network-on-chip Architecture For Multi-Processor SoCs
ICCD '03 Proceedings of the 21st International Conference on Computer Design
The Nostrum Backbone - a Communication Protocol Stack for Networks on Chip
VLSID '04 Proceedings of the 17th International Conference on VLSI Design
On-chip networks: A scalable, communication-centric embedded system design paradigm
VLSID '04 Proceedings of the 17th International Conference on VLSI Design
QNoC: QoS architecture and design process for network on chip
Journal of Systems Architecture: the EUROMICRO Journal - Special issue: Networks on chip
DyAD: smart routing for networks-on-chip
Proceedings of the 41st annual Design Automation Conference
Managing Wire Delay in Large Chip-Multiprocessor Caches
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Speculative Incoherent Cache Protocols
IEEE Micro
NoC Synthesis Flow for Customized Domain Specific Multiprocessor Systems-on-Chip
IEEE Transactions on Parallel and Distributed Systems
Cost considerations in network on chip
Integration, the VLSI Journal - Special issue: Networks on chip and reconfigurable fabrics
HERMES: an infrastructure for low area overhead packet-switching networks on chip
Integration, the VLSI Journal - Special issue: Networks on chip and reconfigurable fabrics
Exploring the energy efficiency of cache coherence protocols in single-chip multi-processors
GLSVLSI '05 Proceedings of the 15th ACM Great Lakes symposium on VLSI
Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors
Proceedings of the 32nd annual international symposium on Computer Architecture
A NUCA substrate for flexible CMP cache sharing
Proceedings of the 19th annual international conference on Supercomputing
Æthereal Network on Chip: Concepts, Architectures, and Implementations
IEEE Design & Test
Interconnect-Aware Coherence Protocols for Chip Multiprocessors
Proceedings of the 33rd annual international symposium on Computer Architecture
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
ASR: Adaptive Selective Replication for CMP Caches
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Efficient synchronization for embedded on-chip multiprocessors
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
A consistency architecture for hierarchical shared caches
Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Best of both worlds: A bus enhanced NoC (BENoC)
NOCS '09 Proceedings of the 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip
81.6 GOPS object recognition processor based on a memory-centric NoC
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Application-aware prioritization mechanisms for on-chip networks
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Aérgia: exploiting packet latency slack in on-chip networks
Proceedings of the 37th annual international symposium on Computer architecture
Latency criticality aware on-chip communication
Proceedings of the Conference on Design, Automation and Test in Europe
NoC-aware cache design for multithreaded execution on tiled chip multiprocessors
Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Energy-efficient cache coherence protocol for NoC-based MPSoCs
Proceedings of the 24th symposium on Integrated circuits and systems design
RAPA: reliability-aware priority arbitration strategy for network on chip
Proceedings of the great lakes symposium on VLSI
The Journal of Supercomputing
Hi-index | 0.00 |
The paper introduces Network-on-Chip (NoC) design methodology and low cost mechanisms for supporting efficient cache access and cache coherency in future high-performance Chip Multi Processors (CMPs). We address previously proposed CMP architectures based on Non Uniform Cache Architecture (NUCA) over NoC, analyze basic memory transactions and translate them into a set of network transactions. We first show how a simple, generic NoC which is equipped with needed module interface functionalities can provide infrastructure for the coherent access of both static and dynamic NUCA. Then we show how several low cost mechanisms incorporated into such a Vanilla NoC can facilitate CMP and boost performance of a cache coherent NUCA CMP. The basic mechanism is based on priority support embedded in the NoC, which differentiates between short control signals and long data messages to achieve a major reduction in cache access delay. The low cost Priority-based NoC is extremely useful for increasing performance of almost any other CMP transaction (i.e. uncached and cache-coherent R/W, search in DNUCA, isolating low priority traffic, synchronization and mutual exclusion support). Priority-based NoC along with the discussed NoC interfaces are evaluated in detail using CMP-NoC simulations across several SPLASH-2 benchmarks and static web content serving benchmarks showing substantial L2 cache access delay reduction and overall program speedup. For further system improvements, we introduce additional low cost NoC mechanisms that include: virtual invalidation rings, efficient store-and-forward multicast for short messages which is embedded within a wormhole NoC, and a cache-line search mechanism for the efficient operation of dynamic NUCA. These mechanisms can also expedite not only cache coherency but also other basic CMP transactions such as search and serialization primitives support.