The migration prefetcher: Anticipating data promotion in dynamic NUCA caches

Authors:
Javier Lira;Timothy M. Jones;Carlos Molina;Antonio González
Affiliations:
Universitat Politècnica de Catalunya, Spain;University of Cambridge, UK;Universitat Rovira i Virgili, Spain;Intel Barcelona Research Center, Intel Labs - UPC, Spain
Venue:
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Year:
2012

Citing 22
Cited 0

Prefetching using Markov predictors

Proceedings of the 24th annual international symposium on Computer architecture
Clock rate versus IPC: the end of the road for conventional microarchitectures

Proceedings of the 27th annual international symposium on Computer architecture
Cache Memories

ACM Computing Surveys (CSUR)
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Simics: A Full System Simulation Platform

Computer
Effective Hardware-Based Data Prefetching for High-Performance Processors

IEEE Transactions on Computers
Orion: a power-performance simulator for interconnection networks

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
TCP: Tag Correlating Prefetchers

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Best of Both Latency and Throughput

ICCD '04 Proceedings of the IEEE International Conference on Computer Design
Managing Wire Delay in Large Chip-Multiprocessor Caches

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Data Cache Prefetching Using a Global History Buffer

HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
SimFlex: Statistical Sampling of Computer System Simulation

IEEE Micro
ASR: Adaptive Selective Replication for CMP Caches

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Analysis of static and dynamic energy consumption in NUCA caches: initial results

MEDEA '07 Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture
Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
A novel migration-based NUCA design for chip multiprocessors

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Leveraging on-chip networks for data cache migration in chip multiprocessors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Reactive NUCA: near-optimal block placement and replication in distributed caches

Proceedings of the 36th annual international symposium on Computer architecture
Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques

Quantified Score

Hi-index	0.00

Visualization

Abstract

The exponential increase in multicore processor (CMP) cache sizes accompanied by growing on-chip wire delays make it difficult to implement traditional caches with a single, uniform access latency. Non-Uniform Cache Architecture (NUCA) designs have been proposed to address this problem. A NUCA divides the whole cache memory into smaller banks and allows banks nearer a processor core to have lower access latencies than those further away, thus mitigating the effects of the cache's internal wires. Determining the best placement for data in the NUCA cache at any particular moment during program execution is crucial for exploiting the benefits that this architecture provides. Dynamic NUCA (D-NUCA) allows data to be mapped to multiple banks within the NUCA cache, and then uses data migration to adapt data placement to the program's behavior. Although the standard migration scheme is effective in moving data to its optimal position within the cache, half the hits still occur within non-optimal banks. This paper reduces this number by anticipating data migrations and moving data to the optimal banks in advance of being required. We introduce a prefetcher component to the NUCA cache that predicts the next memory request based on the past. We develop a realistic implementation of this prefetcher and, furthermore, experiment with a perfect prefetcher that always knows where the data resides, in order to evaluate the limits of this approach. We show that using our realistic data prefetching to anticipate data migrations in the NUCA cache can reduce the access latency by 15% on average and achieve performance improvements of up to 17%.