Proactive instruction fetch

Authors:
Michael Ferdman;Cansu Kaynak;Babak Falsafi
Affiliations:
Carnegie Mellon University, Pittsburgh, PA, and Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland;Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland;Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
Venue:
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2011

Citing 33
Cited 7

Stride directed prefetching in scalar processors

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Wrong-path instruction prefetching

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Performance characterization of a Quad Pentium Pro SMP using OLTP workloads

Proceedings of the 25th annual international symposium on Computer architecture
An analysis of database workload performance on simultaneous multithreaded processors

Proceedings of the 25th annual international symposium on Computer architecture
Cooperative prefetching: compiler and hardware support for effective instruction prefetching in modern processors

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Fetch directed instruction prefetching

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Optimizations Enabled by a Decoupled Front-End Architecture

IEEE Transactions on Computers
Execution-based prediction using speculative slices

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Slipstream processors: improving both performance and fault tolerance

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Code layout optimizations for transaction processing workloads

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
DBMSs on a Modern Processor: Where Does Time Go?

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Instruction Cache Prefetching Using Multilevel Branch Prediction

ISHPC '97 Proceedings of the International Symposium on High Performance Computing
Fetching instruction streams

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Instruction prefetching using branch prediction information

ICCD '97 Proceedings of the 1997 International Conference on Computer Design (ICCD '97)
Branch History Guided Instruction Prefetching

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
In-Line Interrupt Handling for Software-Managed TLBs

ICCD '01 Proceedings of the International Conference on Computer Design: VLSI in Computers & Processors
Call graph prefetching for database applications

ACM Transactions on Computer Systems (TOCS)
Buffering databse operations for enhanced instruction cache performance

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Effective Instruction Prefetching in Chip Multiprocessors for Modern Commercial Applications

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Temporal Streaming of Shared Memory

Proceedings of the 32nd annual international symposium on Computer Architecture
Data Cache Prefetching Using a Global History Buffer

HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
SimFlex: Statistical Sampling of Computer System Simulation

IEEE Micro
Sequential Program Prefetching in Memory Hierarchies

Computer
Enlarging Instruction Streams

IEEE Transactions on Computers
Steps towards cache-resident transaction processing

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Predictor virtualization

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Runahead Execution: An Effective Alternative to Large Instruction Windows

IEEE Micro
Temporal instruction fetch streaming

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Spatio-temporal memory streaming

Proceedings of the 36th annual international symposium on Computer architecture
Reactive NUCA: near-optimal block placement and replication in distributed caches

Proceedings of the 36th annual international symposium on Computer architecture
The IBM system/360 model 91: machine philosophy and instruction-handling

IBM Journal of Research and Development

Reducing OLTP instruction misses with thread migration

DaMoN '12 Proceedings of the Eighth International Workshop on Data Management on New Hardware
SLICC: Self-Assembly of Instruction Cache Collectives for OLTP Workloads

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
OLTP in wonderland: where do cache misses come from in major OLTP components?

Proceedings of the Ninth International Workshop on Data Management on New Hardware
STREX: boosting instruction cache reuse in OLTP workloads through stratified transaction execution

Proceedings of the 40th Annual International Symposium on Computer Architecture
RDIP: return-address-stack directed instruction prefetching

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
SHIFT: shared history instruction fetch for lean-core server processors

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Large-reach memory management unit caches

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Fast access requirements preclude building L1 instruction caches large enough to capture the working set of server workloads. Efforts exist to mitigate limited L1 instruction cache capacity by relying on the stability and repetitiveness of the instruction stream to predict and prefetch future instruction blocks prior to their use. However, dynamic variation in cache miss sequences prevents correct and timely prediction, leaving many instruction-fetch stalls exposed, resulting in a key performance bottleneck for servers. We observe that, while the vast majority of application instruction references are amenable to prediction, even minor control-flow variations are amplified by microarchitectural components, resulting in a major source of instability and randomness that significantly limit prefetcher utility. Control-flow variation disturbs the L1 instruction cache replacement order and branch predictor state, causing the L1 instruction cache to randomly filter the instruction stream while the branch predictor and spontaneous hardware interrupts inject the stream with unpredictable noise. Based on this observation, we show that an instruction prefetcher, previously plagued by microarchitectural instability, becomes nearly perfect when modified to operate on the correct-path, retire-order instruction stream. We propose Proactive Instruction Fetch, an instruction prefetch mechanism that achieves higher than 99.5% instruction-cache hit rate, improving server throughput by 27% and nearly matching the performance of a perfect L1 instruction cache that never misses.