Temporal instruction fetch streaming

  • Authors:
  • Michael Ferdman;Thomas F. Wenisch;Anastasia Ailamaki;Babak Falsafi;Andreas Moshovos

  • Affiliations:
  • Computer Architecture Lab (CALCM), Carnegie Mellon University, Pittsburgh, PA, USA;Advanced Computer Architecture Lab (ACAL), University of Michigan, Ann Arbor, USA;Computer Architecture Lab (CALCM), Carnegie Mellon University, Pittsburgh, PA, USA;Parallel Systems Architecture Lab (PARSA), Ecole Polytechnique Fédérale de Lausanne, Switzerland;Department of ECE, University of Toronto, Canada

  • Venue:
  • Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
  • Year:
  • 2008

Quantified Score

Hi-index 0.01

Visualization

Abstract

L1 instruction-cache misses pose a critical performance bottleneck in commercial server workloads. Cache access latency constraints preclude L1 instruction caches large enough to capture the application, library, and OS instruction working sets of these workloads. To cope with capacity constraints, researchers have proposed instruction prefetchers that use branch predictors to explore future control flow. However, such prefetchers suffer from several fundamental flaws: their lookahead is limited by branch prediction bandwidth, their accuracy suffers from geometrically-compounding branch misprediction probability, and they are ignorant of the cache contents, frequently predicting blocks already present in L1. Hence, L1 instruction misses remain a bottleneck. We propose Temporal Instruction Fetch Streaming (TIFS)—a mechanism for prefetching temporally-correlated instruction streams from lower-level caches. Rather than explore a program’s control flow graph, TIFS predicts future instruction-cache misses directly, through recording and replaying recurring L1 instruction miss sequences. In this paper, we first present an information-theoretic offline trace analysis of instruction-miss repetition to show that 94% of L1 instruction misses occur in long, recurring sequences. Then, we describe a practical mechanism to record these recurring sequences in the L2 cache and leverage them for instruction-cache prefetching. Our TIFS design requires less than 5% storage overhead over the baseline L2 cache and improves performance by 11% on average and 24% at best in a suite of commercial server workloads.