Stride directed prefetching in scalar processors
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
ATOM: a system for building customized program analysis tools
PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Evaluating stream buffers as a secondary cache replacement
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Value locality and load value prediction
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Improving data cache performance by pre-executing instructions under a cache miss
ICS '97 Proceedings of the 11th international conference on Supercomputing
Prefetching using Markov predictors
Proceedings of the 24th annual international symposium on Computer architecture
The predictability of data values
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Dependence based prefetching for linked data structures
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Slice-processors: an implementation of operation-based prediction
ICS '01 Proceedings of the 15th international conference on Supercomputing
Execution-based prediction using speculative slices
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Dead-block prediction & dead-block correlating prefetchers
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Dynamic speculative precomputation
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Automatically characterizing large scale program behavior
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
A stateless, content-directed data prefetching mechanism
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
A quantitative framework for automated pre-execution thread selection
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Enhancing memory level parallelism via recovery-free value prediction
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Slipstream Execution Mode for CMP-Based Multiprocessors
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
TCP: Tag Correlating Prefetchers
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Speculative Data-Driven Multithreading
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Differential FCM: Increasing Value Prediction Accuracy by Improving Table Usage Efficiency
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
A Minimal Dual-Core Speculative Multi-Threading Architecture
ICCD '04 Proceedings of the IEEE International Conference on Computer Design
Checkpointed Early Load Retirement
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
CAVA: Hiding L2 Misses with Checkpoint-Assisted Value Prediction
IEEE Computer Architecture Letters
Emerging Models of Computation: Directions in Molecular Computing
Software-Intensive Systems and New Computing Paradigms
Efficiently exploiting memory level parallelism on asymmetric coupled cores in the dark silicon era
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Leveraging Strength-Based Dynamic Information Flow Analysis to Enhance Data Value Prediction
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.00 |
This paper describes future execution (FE), a simple hardware-only technique to accelerate individual program threads running on multicore microprocessors. Our approach uses available idle cores to prefetch important data for the threads executing on the active cores. FE is based on the observation that many cache misses are caused by loads that execute repeatedly and whose address-generating program slices do not change (much) between consecutive executions. To exploit this property, FE dynamically creates a prefetching thread for each active core by simply sending a copy of all committed, register-writing instructions to an otherwise idle core. The key innovation is that on the way to the second core, a value predictor replaces each predictable instruction in the prefetching thread with a load immediate instruction, where the immediate is the predicted result that the instruction is likely to produce during its nth next dynamic execution. Executing this modified instruction stream (i.e., the prefetching thread) on another core allows to compute the future results of the instructions that are not directly predictable, issue prefetches into the shared memory hierarchy, and thus reduce the primary threads' memory access time. We demonstrate the viability and effectiveness of future execution by performing cycle-accurate simulations of a two-way CMP running the single-threaded SPECcpu2000 benchmark suite. Our mechanism improves program performance by 12%, on average, over a baseline that already includes an optimized hardware stream prefetcher. We further show that FE is complementary to runahead execution and that the combination of these two techniques raises the average speedup to 20% above the performance of the baseline processor with the aggressive stream prefetcher.