PFetch: software prefetching exploiting temporal predictability of memory access streams

Authors:
Jaydeep Marathe;Frank Mueller
Affiliations:
North Carolina State University, Raleigh, NC;North Carolina State University, Raleigh, NC
Venue:
Proceedings of the 9th workshop on MEmory performance: DEaling with Applications, systems and architecture
Year:
2008

Citing 26
Cited 1

Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Prefetching in supercomputer instruction caches

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Efficient detection of all pointer and array access errors

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Evaluating stream buffers as a secondary cache replacement

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Speeding up irregular applications in shared-memory multiprocessors: memory binding and group prefetching

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
SPAID: software prefetching in pointer- and call-intensive environments

Proceedings of the 28th annual international symposium on Microarchitecture
Olden: parallelizing programs with dynamic data structures on distributed-memory machines

Olden: parallelizing programs with dynamic data structures on distributed-memory machines
Memory-system design considerations for dynamically-scheduled processors

Proceedings of the 24th annual international symposium on Computer architecture
Prefetching using Markov predictors

Proceedings of the 24th annual international symposium on Computer architecture
Dependence based prefetching for linked data structures

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Effective jump-pointer prefetching for linked data structures

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Dynamic hot data stream prefetching for general-purpose programs

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
A stateless, content-directed data prefetching mechanism

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
SPEC CPU2000: Measuring CPU Performance in the New Millennium

Computer
Pointer cache assisted prefetching

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Stride prefetching by dynamically inspecting objects

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Quantifying Load Stream Behavior

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Physical Experimentation with Prefetching Helper Threads on Intel's Hyper-Threaded Processors

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Ispike: A Post-link Optimizer for the Intel®Itanium®Architecture

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Enhancing Memory-Level Parallelism via Recovery-Free Value Prediction

IEEE Transactions on Computers
Data Cache Prefetching Using a Global History Buffer

HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
Dynamic Helper Threaded Prefetching on the Sun UltraSPARC CMP Processor

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Memory Prefetching Using Adaptive Stream Detection

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Performance driven data cache prefetching in a dynamic software optimization system

Proceedings of the 21st annual international conference on Supercomputing

A coarse-grained stream architecture for cryo-electron microscopy images 3D reconstruction

Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays

Quantified Score

Hi-index	0.00

Visualization

Abstract

CPU speeds have increased faster than the rate of improvement in memory access latencies in the recent past. As a result, with programs that suffer excessive cache misses, the CPU will increasingly be stalled waiting for the memory system to provide the requested memory line. Prefetching is a latency hiding technique that tackles this problem. If the address of the memory line that misses in cache can be predicted sufficiently in advance, it can be prefetched into the cache before it is accessed, reducing the effective latency of that access. In this paper, we develop a novel software-only data prefetching scheme that works at the instruction level and exploits predictability in the access stream to prefetch memory lines accessed in the future. Working at the instruction level gives us a global view of memory access patterns across function, module and library boundaries. Conceptually, our scheme monitors the memory locations being accessed by loads and stores as well as their contents. It tries to find instances of predictability such that the address of a load miss can be pre-determined from a limited number of past accesses. We make the following contributions in this work. First, we present a novel prefetching strategy that unifies and generalizes a number of past approaches that each target a specific source of address predictability. Specifically, our scheme unifies all these past approaches: next-line prefetching, self-stride prefetching, "intraiteration" stride prefetching and same-object prefetching. In addition, it extends and generalizes the SPAID scheme for pointer and call-intensive programs. Second, we present a new threshold-based approach that addresses the issues of prefetch accuracy, prefetch timeliness and prefetch redundancy. Third, we assess our scheme both with a cache simulator and on a real machine where we evaluate it with hardware performance counters. Overall, we demonstrate that a significant reduction in L1 cache misses can be achieved for several benchmarks on a real machine with our approach.