An evaluation of directory schemes for cache coherence
ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
The Wisconsin multicube: a new large-scale cache-coherent multiprocessor
ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Introducing memory into the switch elements of multiprocessor interconnection networks
ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
LimitLESS directories: A scalable cache coherence scheme
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The DASH prototype: implementation and performance
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
The GLOW cache coherence protocol extensions for widely shared data
ICS '96 Proceedings of the 10th international conference on Supercomputing
The SGI Origin: a ccNUMA highly scalable server
Proceedings of the 24th annual international symposium on Computer architecture
IEEE Transactions on Computers - Special issue on cache memory and related problems
Multicast snooping: a new coherence method using a multicast address network
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
CACHET: an adaptive cache coherence protocol for distributed shared-memory systems
ICS '99 Proceedings of the 13th international conference on Supercomputing
Bochs: A Portable PC Emulator for Unix/X
Linux Journal
Piranha: a scalable architecture based on single-chip multiprocessing
Proceedings of the 27th annual international symposium on Computer architecture
Two economical directory schemes for large-scale cache coherent multiprocessors
ACM SIGARCH Computer Architecture News
Performance Evaluation of the Slotted Ring Multiprocessor
IEEE Transactions on Computers
The Murphi Verification System
CAV '96 Proceedings of the 8th International Conference on Computer Aided Verification
Token coherence: decoupling performance and correctness
Proceedings of the 30th annual international symposium on Computer architecture
Computer Architecture: A Quantitative Approach
Computer Architecture: A Quantitative Approach
The Alpha 21364 Network Architecture
HOTI '01 Proceedings of the The Ninth Symposium on High Performance Interconnects
A New Scalable Directory Architecture for Large-Scale Multiprocessors
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
A Delay Model and Speculative Architecture for Pipelined Routers
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Principles and Practices of Interconnection Networks
Principles and Practices of Interconnection Networks
Speculative Incoherent Cache Protocols
IEEE Micro
Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors
Proceedings of the 32nd annual international symposium on Computer Architecture
Interconnect-Aware Coherence Protocols for Chip Multiprocessors
Proceedings of the 33rd annual international symposium on Computer Architecture
Proximity-aware directory-based coherence for multi-core processor architectures
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
A novel dimensionally-decomposed router for on-chip communication in 3D architectures
Proceedings of the 34th annual international symposium on Computer architecture
The Power of Priority: NoC Based Distributed Cache Coherency
NOCS '07 Proceedings of the First International Symposium on Networks-on-Chip
A consistency architecture for hierarchical shared caches
Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Comparison of memory write policies for NoC based multicore cache coherent systems
Proceedings of the conference on Design, automation and test in Europe
Extending CC-NUMA systems to support write update optimizations
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Improving support for locality and fine-grain sharing in chip multiprocessors
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Leveraging on-chip networks for data cache migration in chip multiprocessors
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
In-Network Caching for Chip Multiprocessors
HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Push-assisted migration of real-time tasks in multi-core processors
Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
81.6 GOPS object recognition processor based on a memory-centric NoC
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
In-network coherence filtering: snoopy coherence without broadcasts
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
An analysis of on-chip interconnection networks for large-scale chip multiprocessors
ACM Transactions on Architecture and Code Optimization (TACO)
Subspace snooping: filtering snoops with operating system support
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Proximity coherence for chip multiprocessors
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
SWEL: hardware cache coherence protocols to map shared data onto shared caches
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Latency criticality aware on-chip communication
Proceedings of the Conference on Design, Automation and Test in Europe
An adaptive cache coherence protocol for chip multiprocessors
Proceedings of the Second International Forum on Next-Generation Multicore/Manycore Technologies
A workload-adaptive and reconfigurable bus architecture for multicore processors
International Journal of Reconfigurable Computing
Run-time energy management of manycore systems through reconfigurable interconnects
Proceedings of the 21st edition of the great lakes symposium on Great lakes symposium on VLSI
A composite and scalable cache coherence protocol for large scale CMPs
Proceedings of the international conference on Supercomputing
Inferring packet dependencies to improve trace based simulation of on-chip networks
NOCS '11 Proceedings of the Fifth ACM/IEEE International Symposium on Networks-on-Chip
An optimized multicore cache coherence design for exploiting communication locality
Proceedings of the great lakes symposium on VLSI
Concerning with on-chip network features to improve cache coherence protocols for CMPs
ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture
Predicting Coherence Communication by Tracking Synchronization Points at Run Time
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Efficient multicast schemes for 3-D Networks-on-Chip
Journal of Systems Architecture: the EUROMICRO Journal
Bandwidth Adaptive Cache Coherence Optimizations for Chip Multiprocessors
International Journal of Parallel Programming
Hi-index | 0.00 |
With the trend towards increasing number of processor cores in future chip architectures, scalable directory-based protocols for maintaining cache coherence will be needed. However, directory-based protocols face well-known problems in delay and scalability. Most current protocol optimizations targeting these problems maintain a firm abstraction of the interconnection network fabric as a communication medium: protocol optimizations consist of endto- end messages between requestor, directory and sharer nodes, while network optimizations separately target lowering communication latency for coherence messages. In this paper, we propose an implementation of the cache coherence protocol within the network, embedding directories within each router node that manage and steer requests towards nearby data copies, enabling in-transit optimization of memory access delay. Simulation results across a range of SPLASH-2 benchmarks demonstrate significant performance improvement and good system scalability, with up to 44.5% and 56% savings in average memory access latency for 16 and 64-node systems, respectively, when compared against the baseline directory cache coherence protocol. Detailed microarchitecture and implementation characterization affirms the low area and delay impact of in-network coherence.