High-bandwidth data memory systems for superscalar processors
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Data prefetching in multiprocessor vector cache memories
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Two-level adaptive training branch prediction
MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
An effective on-chip preloading scheme to reduce data access penalty
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Design and evaluation of a compiler algorithm for prefetching
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Stride directed prefetching in scalar processors
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Evaluating stream buffers as a secondary cache replacement
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Compiler techniques for data prefetching on the PowerPC
PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Effective Hardware-Based Data Prefetching for High-Performance Processors
IEEE Transactions on Computers
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Using value prediction to increase the power of speculative execution hardware
ACM Transactions on Computer Systems (TOCS)
Dependence based prefetching for linked data structures
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Speculation techniques for improving load related instruction scheduling
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Push vs. pull: data movement for linked data structures
Proceedings of the 14th international conference on Supercomputing
Hardware spatial forwarding for widely shared data
Proceedings of the 14th international conference on Supercomputing
Optimizing Overall Loop Schedules Using Prefetching and Partitioning
IEEE Transactions on Parallel and Distributed Systems
DSTRIDE: data-cache miss-address-based stride prefetching scheme for multimedia processors
ACSAC '01 Proceedings of the 6th Australasian conference on Computer systems architecture
Cost-Effective Compiler Directed Memory Prefetching and Bypassing
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Content-Based Prefetching: Initial Results
IMS '00 Revised Papers from the Second International Workshop on Intelligent Memory Systems
Memory-side prefetching for linked data structures for processor-in-memory systems
Journal of Parallel and Distributed Computing
Program Counter-Based Prediction Techniques for Dynamic Power Management
IEEE Transactions on Computers
Program-counter-based pattern classification in buffer caching
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Iterational retiming with partitioning: Loop scheduling with complete memory latency hiding
ACM Transactions on Embedded Computing Systems (TECS)
Hi-index | 0.00 |
We present a new hardware-based data prefetching mechanism for enhancing instruction level parallelism and improving the performance of superscalar processors. The emphasis in our scheme is on the effective utilization of slack time and hardware resources not used for the main computation. The scheme suggests a new hardware construct, the program progress graph (PPG), as a simple extension to the branch target buffer (BTB). We use the PPG for implementing a fast pre-program counter pre-PC, that travels only through memory reference instructions (rather than scanning all the instructions sequentially). In a single clock cycle the pre-PC extracts all the predicted memory references in some future block of instructions, to obtain early data prefetching. In addition, the PPG can be used for implementing a pre-processor and for instruction prefetching. The prefetch requests are scheduled to "range" with the core requests from the data cache, by using only free time slots on the existing data cache tag ports. Employing special methods for removing prefetch requests that are already in the cache (without utilizing the cache-tag ports bandwidth) and a simple optimization on the cache LRU mechanism reduce the number of prefetch requests sent to the core-cache bus and to the memory (second level) bus. Simulation results on the SPEC92 benchmark for the base line architecture (32 K-byte data cache and 12 cycles fetch latency) show an average speedup of 1.36 (CPI ratio).