The Stanford Dash Multiprocessor
Computer
The SGI Origin: a ccNUMA highly scalable server
Proceedings of the 24th annual international symposium on Computer architecture
An empirical evaluation of two memory-efficient directory methods
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Piranha: a scalable architecture based on single-chip multiprocessing
Proceedings of the 27th annual international symposium on Computer architecture
Cache-Only Memory Architectures
Computer
A Novel Approach to Reduce L2 Miss Latency in Shared-Memory Multiprocessors
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Reducing Remote Conflict Misses: NUMA with Remote Cache versus COMA
HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Design and Performance of Directory Caches for Scalable Shared Memory Multiprocessors
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Token coherence: decoupling performance and correctness
Proceedings of the 30th annual international symposium on Computer architecture
The optimum pipeline depth considering both power and performance
ACM Transactions on Architecture and Code Optimization (TACO)
Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors
Proceedings of the 32nd annual international symposium on Computer Architecture
Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling
Proceedings of the 32nd annual international symposium on Computer Architecture
A NUCA substrate for flexible CMP cache sharing
Proceedings of the 19th annual international conference on Supercomputing
Cooperative Caching for Chip Multiprocessors
Proceedings of the 33rd annual international symposium on Computer Architecture
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
RSIM: a simulator for shared-memory multiprocessor and uniprocessor systems that exploit ILP
WCAE-3 '97 Proceedings of the 1997 workshop on Computer architecture education
Utilizing shared data in chip multiprocessors with the Nahalal architecture
Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Scalable and reliable communication for hardware transactional memory
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
A Novel Cache Organization for Tiled Chip Multiprocessor
APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
Microprocessors & Microsystems
Compiler-based data classification for hybrid caching
Proceedings of the 2010 Workshop on Interaction between Compilers and Computer Architecture
Proximity coherence for chip multiprocessors
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
SWEL: hardware cache coherence protocols to map shared data onto shared caches
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
ACM Transactions on Design Automation of Electronic Systems (TODAES)
NoC-aware cache design for multithreaded execution on tiled chip multiprocessors
Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Run-time energy management of manycore systems through reconfigurable interconnects
Proceedings of the 21st edition of the great lakes symposium on Great lakes symposium on VLSI
A modular simulator framework for network-on-chip based manycore chips using UNISIM
Transactions on High-Performance Embedded Architectures and Compilers IV
A hybrid NoC design for cache coherence optimization for chip multiprocessors
Proceedings of the 49th Annual Design Automation Conference
Practically private: enabling high performance CMPs through compiler-assisted data classification
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Predicting Coherence Communication by Tracking Synchronization Points at Run Time
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Hi-index | 0.00 |
As the number of cores increases on chip multiprocessors, coherence is fast becoming a central issue for multi-core performance. This is exacerbated by the fact that interconnection speeds are not scaling well with technology. This paper describes mechanisms to accelerate coherence for a multi-core architecture that has multiple private L2 caches and a scalable point-to-point interconnect between cores. These techniques exploit the differences in geometry between chip multiprocessors and traditional multiprocessor architectures. Directory-based protocols have been proposed as a scalable alternative to snoop-based protocols. In this paper, we discuss implementations of coherence for CMPs and propose and evaluate a novel directory-based coherence scheme to improve the performance of parallel programs on such processors. Proximity-aware coherence accelerates read and write misses by initiating cache-to-cache transfers from the spatially closest sharer. This has the dual benefit of eliminating unnecessary accesses to off-chip memory, and minimizing the distance over which communicated data moves across the network. The proposed schemes result in speedups up to 74.9% for our workloads.