Temporal instruction fetch streaming

Authors:
Michael Ferdman;Thomas F. Wenisch;Anastasia Ailamaki;Babak Falsafi;Andreas Moshovos
Affiliations:
Computer Architecture Lab (CALCM), Carnegie Mellon University, Pittsburgh, PA, USA;Advanced Computer Architecture Lab (ACAL), University of Michigan, Ann Arbor, USA;Computer Architecture Lab (CALCM), Carnegie Mellon University, Pittsburgh, PA, USA;Parallel Systems Architecture Lab (PARSA), Ecole Polytechnique Fédérale de Lausanne, Switzerland;Department of ECE, University of Toronto, Canada
Venue:
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Year:
2008

Citing 34
Cited 7

Trace cache: a low latency approach to high bandwidth instruction fetching

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Performance characterization of a Quad Pentium Pro SMP using OLTP workloads

Proceedings of the 25th annual international symposium on Computer architecture
An analysis of database workload performance on simultaneous multithreaded processors

Proceedings of the 25th annual international symposium on Computer architecture
Cooperative prefetching: compiler and hardware support for effective instruction prefetching in modern processors

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Whole program paths

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Fetch directed instruction prefetching

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Predictor-directed stream buffers

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Optimizations Enabled by a Decoupled Front-End Architecture

IEEE Transactions on Computers
Efficient representations and abstractions for quantifying and exploiting data reference locality

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Execution-based prediction using speculative slices

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Slipstream processors: improving both performance and fault tolerance

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Code layout optimizations for transaction processing workloads

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Using a user-level memory thread for correlation prefetching

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
DBMSs on a Modern Processor: Where Does Time Go?

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Fetching instruction streams

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Instruction prefetching using branch prediction information

ICCD '97 Proceedings of the 1997 International Conference on Computer Design (ICCD '97)
Branch History Guided Instruction Prefetching

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Call graph prefetching for database applications

ACM Transactions on Computer Systems (TOCS)
Buffering databse operations for enhanced instruction cache performance

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Effective Instruction Prefetching in Chip Multiprocessors for Modern Commercial Applications

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Temporal Streaming of Shared Memory

Proceedings of the 32nd annual international symposium on Computer Architecture
Data Cache Prefetching Using a Global History Buffer

HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
SimFlex: Statistical Sampling of Computer System Simulation

IEEE Micro
Sequential Program Prefetching in Memory Hierarchies

Computer
Trace Scheduling: A Technique for Global Microcode Compaction

IEEE Transactions on Computers
Enlarging Instruction Streams

IEEE Transactions on Computers
Steps towards cache-resident transaction processing

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Low-Cost Epoch-Based Correlation Prefetching for Commercial Applications

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Predictor virtualization

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Temporal memory streaming

Temporal memory streaming
Runahead Execution: An Effective Alternative to Large Instruction Windows

IEEE Micro
Identifying hierarchical structure in sequences: a linear-time algorithm

Journal of Artificial Intelligence Research

Spatio-temporal memory streaming

Proceedings of the 36th annual international symposium on Computer architecture
Proactive instruction fetch

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Reducing OLTP instruction misses with thread migration

DaMoN '12 Proceedings of the Eighth International Workshop on Data Management on New Hardware
SLICC: Self-Assembly of Instruction Cache Collectives for OLTP Workloads

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
STREX: boosting instruction cache reuse in OLTP workloads through stratified transaction execution

Proceedings of the 40th Annual International Symposium on Computer Architecture
RDIP: return-address-stack directed instruction prefetching

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
SHIFT: shared history instruction fetch for lean-core server processors

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.01

Visualization

Abstract

L1 instruction-cache misses pose a critical performance bottleneck in commercial server workloads. Cache access latency constraints preclude L1 instruction caches large enough to capture the application, library, and OS instruction working sets of these workloads. To cope with capacity constraints, researchers have proposed instruction prefetchers that use branch predictors to explore future control flow. However, such prefetchers suffer from several fundamental flaws: their lookahead is limited by branch prediction bandwidth, their accuracy suffers from geometrically-compounding branch misprediction probability, and they are ignorant of the cache contents, frequently predicting blocks already present in L1. Hence, L1 instruction misses remain a bottleneck. We propose Temporal Instruction Fetch Streaming (TIFS)—a mechanism for prefetching temporally-correlated instruction streams from lower-level caches. Rather than explore a program’s control flow graph, TIFS predicts future instruction-cache misses directly, through recording and replaying recurring L1 instruction miss sequences. In this paper, we first present an information-theoretic offline trace analysis of instruction-miss repetition to show that 94% of L1 instruction misses occur in long, recurring sequences. Then, we describe a practical mechanism to record these recurring sequences in the L2 cache and leverage them for instruction-cache prefetching. Our TIFS design requires less than 5% storage overhead over the baseline L2 cache and improves performance by 11% on average and 24% at best in a suite of commercial server workloads.