Stride directed prefetching in scalar processors
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Wrong-path instruction prefetching
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Performance characterization of a Quad Pentium Pro SMP using OLTP workloads
Proceedings of the 25th annual international symposium on Computer architecture
An analysis of database workload performance on simultaneous multithreaded processors
Proceedings of the 25th annual international symposium on Computer architecture
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Fetch directed instruction prefetching
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Optimizations Enabled by a Decoupled Front-End Architecture
IEEE Transactions on Computers
Execution-based prediction using speculative slices
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Slipstream processors: improving both performance and fault tolerance
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Code layout optimizations for transaction processing workloads
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
DBMSs on a Modern Processor: Where Does Time Go?
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Instruction Cache Prefetching Using Multilevel Branch Prediction
ISHPC '97 Proceedings of the International Symposium on High Performance Computing
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Instruction prefetching using branch prediction information
ICCD '97 Proceedings of the 1997 International Conference on Computer Design (ICCD '97)
Branch History Guided Instruction Prefetching
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
In-Line Interrupt Handling for Software-Managed TLBs
ICCD '01 Proceedings of the International Conference on Computer Design: VLSI in Computers & Processors
Call graph prefetching for database applications
ACM Transactions on Computer Systems (TOCS)
Buffering databse operations for enhanced instruction cache performance
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Effective Instruction Prefetching in Chip Multiprocessors for Modern Commercial Applications
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Temporal Streaming of Shared Memory
Proceedings of the 32nd annual international symposium on Computer Architecture
Data Cache Prefetching Using a Global History Buffer
HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
IEEE Transactions on Computers
Steps towards cache-resident transaction processing
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Temporal instruction fetch streaming
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Spatio-temporal memory streaming
Proceedings of the 36th annual international symposium on Computer architecture
Reactive NUCA: near-optimal block placement and replication in distributed caches
Proceedings of the 36th annual international symposium on Computer architecture
The IBM system/360 model 91: machine philosophy and instruction-handling
IBM Journal of Research and Development
Reducing OLTP instruction misses with thread migration
DaMoN '12 Proceedings of the Eighth International Workshop on Data Management on New Hardware
SLICC: Self-Assembly of Instruction Cache Collectives for OLTP Workloads
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
OLTP in wonderland: where do cache misses come from in major OLTP components?
Proceedings of the Ninth International Workshop on Data Management on New Hardware
STREX: boosting instruction cache reuse in OLTP workloads through stratified transaction execution
Proceedings of the 40th Annual International Symposium on Computer Architecture
RDIP: return-address-stack directed instruction prefetching
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
SHIFT: shared history instruction fetch for lean-core server processors
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Large-reach memory management unit caches
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Hi-index | 0.00 |
Fast access requirements preclude building L1 instruction caches large enough to capture the working set of server workloads. Efforts exist to mitigate limited L1 instruction cache capacity by relying on the stability and repetitiveness of the instruction stream to predict and prefetch future instruction blocks prior to their use. However, dynamic variation in cache miss sequences prevents correct and timely prediction, leaving many instruction-fetch stalls exposed, resulting in a key performance bottleneck for servers. We observe that, while the vast majority of application instruction references are amenable to prediction, even minor control-flow variations are amplified by microarchitectural components, resulting in a major source of instability and randomness that significantly limit prefetcher utility. Control-flow variation disturbs the L1 instruction cache replacement order and branch predictor state, causing the L1 instruction cache to randomly filter the instruction stream while the branch predictor and spontaneous hardware interrupts inject the stream with unpredictable noise. Based on this observation, we show that an instruction prefetcher, previously plagued by microarchitectural instability, becomes nearly perfect when modified to operate on the correct-path, retire-order instruction stream. We propose Proactive Instruction Fetch, an instruction prefetch mechanism that achieves higher than 99.5% instruction-cache hit rate, improving server throughput by 27% and nearly matching the performance of a perfect L1 instruction cache that never misses.