Improving support for locality and fine-grain sharing in chip multiprocessors

Authors:
Hemayet Hossain;Sandhya Dwarkadas;Michael C. Huang
Affiliations:
University of Rochester, Rochester, NY, USA;University of Rochester, Rochester, NY, USA;University of Rochester, Rochester, NY, USA
Venue:
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Year:
2008

Citing 27
Cited 5

Adaptive cache coherency for detecting migratory shared data

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
An adaptive cache coherence protocol optimized for migratory sharing

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Implementing global memory management in a workstation cluster

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Serverless network file systems

ACM Transactions on Computer Systems (TOCS) - Special issue on operating system principles
Generating representative Web workloads for network and server performance evaluation

SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Using prediction to accelerate coherence protocols

Proceedings of the 25th annual international symposium on Computer architecture
Memory sharing predictor: the key to a speculative coherent DSM

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
On the scale and performance of cooperative Web proxy caching

Proceedings of the seventeenth ACM symposium on Operating systems principles
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Simics: A Full System Simulation Platform

Computer
The MIPS R10000 Superscalar Microprocessor

IEEE Micro
Simulating a $2M Commercial Server on a $2K PC

Computer
Improving CC-NUMA Performance Using Instruction-Based Prediction

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors

Proceedings of the 30th annual international symposium on Computer architecture
Distance Associativity for High-Performance Energy-Efficient Non-Uniform Cache Architectures

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Montecito: A Dual-Core, Dual-Thread Itanium Processor

IEEE Micro
Niagara: A 32-Way Multithreaded Sparc Processor

IEEE Micro
Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors

Proceedings of the 32nd annual international symposium on Computer Architecture
Optimizing Replication, Communication, and Capacity Allocation in CMPs

Proceedings of the 32nd annual international symposium on Computer Architecture
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Cooperative Caching for Chip Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
Adaptive Parallel Graph Mining for CMP Architectures

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
In-Network Cache Coherence

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
ASR: Adaptive Selective Replication for CMP Caches

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Power/Performance/Thermal Design-Space Exploration for Multicore Architectures

IEEE Transactions on Parallel and Distributed Systems

Proximity coherence for chip multiprocessors

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Multi-CMP system with data communication on the fly

The Journal of Supercomputing
MAximum Multicore POwer (MAMPO): an automatic multithreaded synthetic power virus generation framework for multicore systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Predicting Coherence Communication by Tracking Synchronization Points at Run Time

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Building expressive, area-efficient coherence directories

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Quantified Score

Hi-index	0.00

Visualization

Abstract

Both commercial and scientific workloads benefit from concurrency and exhibit data sharing across threads/processes. The resulting sharing patterns are often fine-grain, with the modified cache lines still residing in the writer's primary cache when accessed. Chip multiprocessors present an opportunity to optimize for fine-grain sharing using direct access to remote processor components through low-latency on-chip interconnects. In this paper, we present Adaptive Replication, Migration, and producer-Consumer Optimization (ARMCO), a coherence protocol that, to the best of our knowledge, is the first to exploit direct access to the L1 caches of remote processors (rather than via coherence mechanisms) in order to support fine-grain sharing. Our goal is to provide support for tightly coupled sharing by recognizing and adapting to common sharing patterns such as migratory, producer-consumer, multiple-reader, and multiple read-write. The protocol places data close to where it is most needed and leverages direct access when following conventional coherence actions proves wasteful. Via targeted optimizations for each of these access patterns, our proposed protocol is able to reduce the average access latency and increase the effective cache capacity at the L1 level with on-chip storage overhead as low as 0.38%. Full-system simulations of 16-processor CMPs show an average (geometric mean) speedup of 1.13 (ranging from 1.04 to 2.26) for 12 commercial, scientific, and mining workloads, with an average of 1.18 if we include 2 microbenchmarks. ARMCO also reduces the on-chip bandwidth requirements and dynamic energy (power) consumption by an average of 33.3% and 31.2% (20.2%) respectively. By evaluating optimizations at both the L1 and the L2 level, we demonstrate that when considering performance, optimization at the L1 level is more effective at supporting fine-grain sharing than that at the L2 level.