Execution History Guided Instruction Prefetching

Authors:
Yi Zhang;Steve Haga;Rajeev Barua
Affiliations:
Actuate Corporation, South San Francisco, CA 94080, USA;Department of Electrical and Computer Engineering, University of Maryland, College Park, MD 20742, USA;Department of Electrical and Computer Engineering, University of Maryland, College Park, MD 20742, USA
Venue:
The Journal of Supercomputing
Year:
2004

Citing 16
Cited 0

Contrasting characteristics and cache performance of technical and multi-user commercial workloads

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Wrong-path instruction prefetching

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
On high-bandwidth data cache design for multi-issue processors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
A Performance Study of Instruction Cache Prefetching Methods

IEEE Transactions on Computers
CPU Cache Prefetching: Timing Evaluation of Hardware Implementations

IEEE Transactions on Computers
Computer architecture (2nd ed.): a quantitative approach

Computer architecture (2nd ed.): a quantitative approach
Cooperative prefetching: compiler and hardware support for effective instruction prefetching in modern processors

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Improving prediction for procedure returns with return-address-stack repair mechanisms

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Prefetching Using Markov Predictors

IEEE Transactions on Computers - Special issue on cache memory and related problems
Fetch directed instruction prefetching

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Circuits for wide-window superscalar processors

Proceedings of the 27th annual international symposium on Computer architecture
Cache Memories

ACM Computing Surveys (CSUR)
One Billion Transistors, One Uniprocessor, One Chip

Computer
UltraSPARC-III: Designing Third-Generation 64-Bit Performance

IEEE Micro
DBMSs on a Modern Processor: Where Does Time Go?

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Branch History Guided Instruction Prefetching

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

The increasing gap in performance between processors and main memory has made effective instructions prefetching techniques more important than ever. A major deficiency of existing prefetching methods is that most of them require an extra port to I-cache. A recent study by Rivers et al. [19] shows that this factor alone explains why most modern microprocessors do not use such hardware-based I-cache prefetch schemes. The contribution of this paper is two-fold. First, we present a method that does not require an extra port to I-cache. Second, the performance improvement for our method is greater than the best competing method BHGP [23] even disregarding the improvement from not having an extra port. The three key features of our method that prevent the above deficiencies are as follows. First, late prefetching is prevented by correlating misses to dynamically preceding instructions. For example, if the I-cache miss latency is 12 cycles, then the instruction that was fetched 12 cycles prior to the miss is used as the prefetch trigger. Second, the miss history table is kept to a reasonable size by grouping contiguous cache misses together and associated them with one preceding instruction, and therefore, one table entry. Third, the extra I-cache port is avoided through efficient prefetch filtering methods. Experiments show that for our benchmarks, chosen for their poor I-cache performance, an average improvement of 9.2% in runtime is achieved versus the BHGP methods [23], while the hardware cost is also reduced. The improvement will be greater if the runtime impact of avoiding an extra port is considered. When compared to the original machine without prefetching, our method improves performance by about 35% for our benchmarks.