PFetch: software prefetching exploiting temporal predictability of memory access streams

  • Authors:
  • Jaydeep Marathe;Frank Mueller

  • Affiliations:
  • North Carolina State University, Raleigh, NC;North Carolina State University, Raleigh, NC

  • Venue:
  • Proceedings of the 9th workshop on MEmory performance: DEaling with Applications, systems and architecture
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

CPU speeds have increased faster than the rate of improvement in memory access latencies in the recent past. As a result, with programs that suffer excessive cache misses, the CPU will increasingly be stalled waiting for the memory system to provide the requested memory line. Prefetching is a latency hiding technique that tackles this problem. If the address of the memory line that misses in cache can be predicted sufficiently in advance, it can be prefetched into the cache before it is accessed, reducing the effective latency of that access. In this paper, we develop a novel software-only data prefetching scheme that works at the instruction level and exploits predictability in the access stream to prefetch memory lines accessed in the future. Working at the instruction level gives us a global view of memory access patterns across function, module and library boundaries. Conceptually, our scheme monitors the memory locations being accessed by loads and stores as well as their contents. It tries to find instances of predictability such that the address of a load miss can be pre-determined from a limited number of past accesses. We make the following contributions in this work. First, we present a novel prefetching strategy that unifies and generalizes a number of past approaches that each target a specific source of address predictability. Specifically, our scheme unifies all these past approaches: next-line prefetching, self-stride prefetching, "intraiteration" stride prefetching and same-object prefetching. In addition, it extends and generalizes the SPAID scheme for pointer and call-intensive programs. Second, we present a new threshold-based approach that addresses the issues of prefetch accuracy, prefetch timeliness and prefetch redundancy. Third, we assess our scheme both with a cache simulator and on a real machine where we evaluate it with hardware performance counters. Overall, we demonstrate that a significant reduction in L1 cache misses can be achieved for several benchmarks on a real machine with our approach.