Reducing memory latency via non-blocking and prefetching caches
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Design and evaluation of a compiler algorithm for prefetching
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Stride directed prefetching in scalar processors
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Compiler-based prefetching for recursive data structures
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Prefetching using Markov predictors
Proceedings of the 24th annual international symposium on Computer architecture
Dependence based prefetching for linked data structures
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Push vs. pull: data movement for linked data structures
Proceedings of the 14th international conference on Supercomputing
ACM Computing Surveys (CSUR)
Using a user-level memory thread for correlation prefetching
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A stateless, content-directed data prefetching mechanism
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
The Alpha 21264 Microprocessor
IEEE Micro
Data Flow Analysis for Software Prefetching Linked Data Structures in Java
Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Pointer cache assisted prefetching
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Guided region prefetching: a cooperative hardware/software approach
Proceedings of the 30th annual international symposium on Computer architecture
Reducing DRAM Latencies with an Integrated Memory Hierarchy Design
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Microarchitecture Optimizations for Exploiting Memory-Level Parallelism
Proceedings of the 31st annual international symposium on Computer architecture
Checkpointed Early Load Retirement
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Chip Multithreading: Opportunities and Challenges
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Tolerating memory latency through push prefetching for pointer-intensive applications
ACM Transactions on Architecture and Code Optimization (TACO)
Memory-side prefetching for linked data structures for processor-in-memory systems
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
Modern out-of-order processors with non-blocking caches exploit Memory-Level Parallelism (MLP) by overlapping cache misses in a wide instruction window. The exploitation of MLP, however, can be limited due to long-latency operations in producing the base address of a cache miss load. When the parent instruction is also a cache miss load, a serialization of the two loads must be enforced to satisfy the load-load data dependence.In this paper, we propose a mechanism that dynamically captures the load-load data dependences at runtime. A special Preload is issued in place of the dependent load without waiting for the parent load, thus effectively overlapping the two loads. The Preload provides necessary information for the memory controller to calculate the correct memory address upon the availability of the parent's data to eliminate any interconnect delay between the two loads. Performance evaluations based on SPEC2000 and Olden applications show that significant speedups up to 40% with an average of 16% are achievable using the Preload. In conjunction with other aggressive MLP exploitation methods, such as runahead execution, the Preload can make more significant improvement with an average of 22%.