Dynamic path-based branch correlation
Proceedings of the 28th annual international symposium on Microarchitecture
Wrong-path instruction prefetching
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Fetch directed instruction prefetching
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Hardware-only stream prefetching and dynamic access ordering
Proceedings of the 14th international conference on Supercomputing
Selective, accurate, and timely self-invalidation using last-touch prediction
Proceedings of the 27th annual international symposium on Computer architecture
Architectural and compiler support for effective instruction prefetching: a cooperative approach
ACM Transactions on Computer Systems (TOCS)
Optimizations Enabled by a Decoupled Front-End Architecture
IEEE Transactions on Computers
Execution-based prediction using speculative slices
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Slipstream processors: improving both performance and fault tolerance
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Dead-block prediction & dead-block correlating prefetchers
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Code layout optimizations for transaction processing workloads
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Instruction Cache Prefetching Using Multilevel Branch Prediction
ISHPC '97 Proceedings of the International Symposium on High Performance Computing
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Instruction prefetching using branch prediction information
ICCD '97 Proceedings of the 1997 International Conference on Computer Design (ICCD '97)
Branch History Guided Instruction Prefetching
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Call graph prefetching for database applications
ACM Transactions on Computer Systems (TOCS)
Buffering databse operations for enhanced instruction cache performance
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Effective Instruction Prefetching in Chip Multiprocessors for Modern Commercial Applications
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
IEEE Transactions on Computers
Steps towards cache-resident transaction processing
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Temporal instruction fetch streaming
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Reactive NUCA: near-optimal block placement and replication in distributed caches
Proceedings of the 36th annual international symposium on Computer architecture
The IBM system/360 model 91: machine philosophy and instruction-handling
IBM Journal of Research and Development
ACM SIGARCH Computer Architecture News
Clearing the clouds: a study of emerging scale-out workloads on modern hardware
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Workload analysis of a large-scale key-value store
Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems
Proceedings of the 39th Annual International Symposium on Computer Architecture
Full-system analysis and characterization of interactive smartphone applications
IISWC '11 Proceedings of the 2011 IEEE International Symposium on Workload Characterization
Thin servers with smart pipes: designing SoC accelerators for memcached
Proceedings of the 40th Annual International Symposium on Computer Architecture
SHIFT: shared history instruction fetch for lean-core server processors
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
SHIFT: shared history instruction fetch for lean-core server processors
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Hi-index | 0.00 |
L1 instruction fetch misses remain a critical performance bottleneck, accounting for up to 40% slowdowns in server applications. Whereas instruction footprints typically fit within last-level caches, they overwhelm L1 caches, whose capacity is limited by latency constraints. Past work has shown that server application instruction miss sequences are highly repetitive. By recording, indexing, and prefetching according to these sequences, nearly all L1 instruction misses can be eliminated. However, existing schemes require impractical storage and considerable complexity to correct for minor control-flow variations that disrupt sequences. In this work, we simplify and reduce the energy requirements of accurate instruction prefetching via two observations: (1) program context as captured in the call stack correlates strongly with L1 instruction misses, and (2) the return address stack (RAS), already present in all high performance processors, succinctly summarizes program context. We propose RAS-Directed Instruction Prefetching (RDIP), which associates prefetch operations with signatures formed from the contents of the RAS. RDIP achieves 70% of the potential speedup of an ideal L1 cache, outperforms a prefetcherless baseline by 11.5% and reduces energy and complexity relative to sequence-based prefetching. RDIP's performance is within 2% of the state-of-the-art Proactive Instruction Fetch, with nearly 3X reduction in storage and 1.9X reduction in energy overheads.