Effective stream-based and execution-based data prefetching

Authors:
Sorin Iacobovici;Lawrence Spracklen;Sudarshan Kadambi;Yuan Chou;Santosh G. Abraham
Affiliations:
Sun Microsystems Inc., Sunnyvale, CA;Sun Microsystems Inc., Sunnyvale, CA;Sun Microsystems Inc., Sunnyvale, CA;Sun Microsystems Inc., Sunnyvale, CA;Sun Microsystems Inc., Sunnyvale, CA
Venue:
Proceedings of the 18th annual international conference on Supercomputing
Year:
2004

Citing 24
Cited 14

An effective on-chip preloading scheme to reduce data access penalty

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Stride directed prefetching in scalar processors

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Evaluating stream buffers as a secondary cache replacement

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Improving data cache performance by pre-executing instructions under a cache miss

ICS '97 Proceedings of the 11th international conference on Supercomputing
Memory-system design considerations for dynamically-scheduled processors

Proceedings of the 24th annual international symposium on Computer architecture
Profetching and memory system behavior of the SPEC95 benchmark suite

IBM Journal of Research and Development - Special issue: performance analysis and its impact on design
Dependence based prefetching for linked data structures

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Prefetching Using Markov Predictors

IEEE Transactions on Computers - Special issue on cache memory and related problems
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Hardware-only stream prefetching and dynamic access ordering

Proceedings of the 14th international conference on Supercomputing
Predictor-directed stream buffers

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Dead-block prediction & dead-block correlating prefetchers

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Post-pass binary adaptation for software-based speculative precomputation

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Design and evaluation of compiler algorithms for pre-execution

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Effective Hardware-Based Data Prefetching for High-Performance Processors

IEEE Transactions on Computers
Sequential Hardware Prefetching in Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
How Useful Are Non-Blocking Loads, Stream Buffers and Speculative Execution in Multiple Issue Processors?

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
TCP: Tag Correlating Prefetchers

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Neighborhood Prefetching on Multiprocessors Using Instruction History

PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
A Prefetch Taxonomy

IEEE Transactions on Computers
Data Cache Prefetching Using a Global History Buffer

HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
Processor Aware Anticipatory Prefetching in Loops

HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture

Techniques for Efficient Processing in Runahead Execution Engines

Proceedings of the 32nd annual international symposium on Computer Architecture
An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

IEEE Transactions on Computers
Store Memory-Level Parallelism Optimizations for Commercial Applications

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
On the importance of optimizing the configuration of stream prefetchers

Proceedings of the 2005 workshop on Memory system performance
Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance

IEEE Micro
Scientific applications vs. SPEC-FP: a comparison of program behavior

Proceedings of the 20th annual international conference on Supercomputing
Design exploration of hybrid caches with disparate memory technologies

ACM Transactions on Architecture and Code Optimization (TACO)
Quantifying and reducing the effects of wrong-path memory references in cache-coherent multiprocessor systems

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Coterminous locality and coterminous group data prefetching on chip-multiprocessors

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Many-Thread Aware Prefetching Mechanisms for GPGPU Applications

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
PACMan: prefetch-aware cache management for high performance caching

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
S/DC: a storage and energy efficient data prefetcher

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
Meeting midway: improving CMP performance with memory-side prefetching

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Linearly compressed pages: a low-complexity, low-latency main memory compression framework

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

With processor speeds continuing to outpace the memory subsystem, cache missing memory operations continue to become increasingly important to application performance. In response to this continuing trend, most modern processors now support hardware (HW) prefetchers, which act to reduce the missing loads observed by an application.This paper analyzes the behavior of cache-missing loads in SPEC CPU2000 and highlights the inability of unit and single non-unit stride prefetchers to correctly prefetch for some commonly occurring streams. In response to this analysis, a novel multi-stride prefetcher, that supports streams with up to four distinct strides, is proposed. Performance analysis for SPEC CPU2000 illustrates that the proposed multi-stride prefetcher can outperform current stride prefetchers on several benchmarks; most notably on mcf, lucas and facerec, where it achieves an additional performance gain of up to 57%. Performance of the strided HW prefetchers is also contrasted with another recently proposed prefetch scheme, runahead execution (RAE), and the synergy between the schemes is investigated.