Stream chaining: exploiting multiple levels of correlation in data prefetching

Authors:
Pedro Diaz;Marcelo Cintra
Affiliations:
University of Edinburgh, Edinburgh, United Kingdom;University of Edinburgh, Edinburgh, United Kingdom
Venue:
Proceedings of the 36th annual international symposium on Computer architecture
Year:
2009

Citing 20
Cited 11

An effective on-chip preloading scheme to reduce data access penalty

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Stride directed prefetching in scalar processors

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Evaluating stream buffers as a secondary cache replacement

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Prefetching using Markov predictors

Proceedings of the 24th annual international symposium on Computer architecture
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Dead-block prediction & dead-block correlating prefetchers

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Using a user-level memory thread for correlation prefetching

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Going the distance for TLB prefetching: an application-driven study

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Timekeeping in the memory system: predicting and optimizing memory behavior

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
TCP: Tag Correlating Prefetchers

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Branch History Guided Instruction Prefetching

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
AC/DC: An Adaptive Data Cache Prefetcher

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
MicroLib: A Case for the Quantitative Comparison of Micro-Architecture Mechanisms

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Temporal Streaming of Shared Memory

Proceedings of the 32nd annual international symposium on Computer Architecture
Data Cache Prefetching Using a Global History Buffer

HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
A Self-Repairing Prefetcher in an Event-Driven Dynamic Optimization Framework

Proceedings of the International Symposium on Code Generation and Optimization
Spatial Memory Streaming

Proceedings of the 33rd annual international symposium on Computer Architecture
BioBench: A Benchmark Suite of Bioinformatics Applications

ISPASS '05 Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2005
IBM POWER6 microarchitecture

IBM Journal of Research and Development
Spatio-temporal memory streaming

Proceedings of the 36th annual international symposium on Computer architecture

Spatio-temporal memory streaming

Proceedings of the 36th annual international symposium on Computer architecture
Timing local streams: improving timeliness in data prefetching

Proceedings of the 24th ACM International Conference on Supercomputing
Global-aware and multi-order context-based prefetching for high-performance processors

International Journal of High Performance Computing Applications
Unified memory optimizing architecture: memory subsystem control with a unified predictor

Proceedings of the 26th ACM international conference on Supercomputing
Algorithm-level Feedback-controlled Adaptive data prefetcher: Accelerating data access for high-performance processors

Parallel Computing
Prefetching and cache management using task lifetimes

Proceedings of the 27th international ACM conference on International conference on supercomputing
S/DC: a storage and energy efficient data prefetcher

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
A Prefetching Scheme Exploiting both Data Layout and Access History on Disk

ACM Transactions on Storage (TOS)
APOGEE: adaptive prefetching on GPUs for energy efficiency

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Linearizing irregular memory accesses for improved correlated prefetching

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
The case of using multiple streams in streaming

International Journal of Automation and Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data prefetching has long been an important technique to amortize the effects of the memory wall, and is likely to remain so in the current era of multi-core systems. Most prefetchers operate by identifying patterns and correlations in the miss address stream. Separating streams according to the memory access instruction that generates the misses is an effective way of filtering out spurious addresses from predictable streams. On the other hand, by localizing streams based on the memory access instructions, such prefetchers both lose the complete time sequence information of misses and can only issue prefetches for a single memory access instruction at a time. This paper proposes a novel class of prefetchers based on the idea of linking various localized streams into predictable chains of missing memory access instructions such that the prefetcher can issue prefetches along multiple streams. In this way the prefetcher is not limited to prefetching deeply for a single missing memory access instruction but can instead adaptively prefetch for other memory access instructions closer in time. Experimental results show that the proposed prefetcher consistently achieves better performance than a state-of-the-art prefetcher -- 10% on average, being only outperformed in very few cases and then by only 2%, and outperforming that prefetcher by as much as 55% -- while consuming the same amount of memory bandwidth.