Tango: a hardware-based data prefetching technique for superscalar processors

Authors:
Shlomit S. Pinter;Adi Yoaz
Affiliations:
IBM Science and Technology, MATAM Advance Technology ctr., Haifa 31905, Israel;Intel Israel (74), MATAM Advance Technology ctr., Haifa 31905, Israel
Venue:
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Year:
1996

Citing 11
Cited 13

High-bandwidth data memory systems for superscalar processors

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Data prefetching in multiprocessor vector cache memories

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Two-level adaptive training branch prediction

MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
An effective on-chip preloading scheme to reduce data access penalty

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Stride directed prefetching in scalar processors

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Evaluating stream buffers as a secondary cache replacement

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Compiler techniques for data prefetching on the PowerPC

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Effective Hardware-Based Data Prefetching for High-Performance Processors

IEEE Transactions on Computers
How Useful Are Non-Blocking Loads, Stream Buffers and Speculative Execution in Multiple Issue Processors?

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture

Using value prediction to increase the power of speculative execution hardware

ACM Transactions on Computer Systems (TOCS)
Dependence based prefetching for linked data structures

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Speculation techniques for improving load related instruction scheduling

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Push vs. pull: data movement for linked data structures

Proceedings of the 14th international conference on Supercomputing
Hardware spatial forwarding for widely shared data

Proceedings of the 14th international conference on Supercomputing
Optimizing Overall Loop Schedules Using Prefetching and Partitioning

IEEE Transactions on Parallel and Distributed Systems
DSTRIDE: data-cache miss-address-based stride prefetching scheme for multimedia processors

ACSAC '01 Proceedings of the 6th Australasian conference on Computer systems architecture
Cost-Effective Compiler Directed Memory Prefetching and Bypassing

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Content-Based Prefetching: Initial Results

IMS '00 Revised Papers from the Second International Workshop on Intelligent Memory Systems
Memory-side prefetching for linked data structures for processor-in-memory systems

Journal of Parallel and Distributed Computing
Program Counter-Based Prediction Techniques for Dynamic Power Management

IEEE Transactions on Computers
Program-counter-based pattern classification in buffer caching

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Iterational retiming with partitioning: Loop scheduling with complete memory latency hiding

ACM Transactions on Embedded Computing Systems (TECS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a new hardware-based data prefetching mechanism for enhancing instruction level parallelism and improving the performance of superscalar processors. The emphasis in our scheme is on the effective utilization of slack time and hardware resources not used for the main computation. The scheme suggests a new hardware construct, the program progress graph (PPG), as a simple extension to the branch target buffer (BTB). We use the PPG for implementing a fast pre-program counter pre-PC, that travels only through memory reference instructions (rather than scanning all the instructions sequentially). In a single clock cycle the pre-PC extracts all the predicted memory references in some future block of instructions, to obtain early data prefetching. In addition, the PPG can be used for implementing a pre-processor and for instruction prefetching. The prefetch requests are scheduled to "range" with the core requests from the data cache, by using only free time slots on the existing data cache tag ports. Employing special methods for removing prefetch requests that are already in the cache (without utilizing the cache-tag ports bandwidth) and a simple optimization on the cache LRU mechanism reduce the number of prefetch requests sent to the core-cache bus and to the memory (second level) bus. Simulation results on the SPEC92 benchmark for the base line architecture (32 K-byte data cache and 12 cycles fetch latency) show an average speedup of 1.36 (CPI ratio).