Architectural and compiler support for effective instruction prefetching: a cooperative approach

Authors:
Affiliations:
Venue:
ACM Transactions on Computer Systems (TOCS)
Year:
2001

Citing 26
Cited 9

Compilers: principles, techniques, and tools

Compilers: principles, techniques, and tools
Achieving high instruction cache performance with an optimizing compiler

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Profile guided code positioning

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
MIPS RISC architectures

MIPS RISC architectures
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Prefetching in supercomputer instruction caches

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Contrasting characteristics and cache performance of technical and multi-user commercial workloads

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Compiler techniques for data prefetching on the PowerPC

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Instruction prefetching of systems codes with layout optimized for reduced cache misses

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Wrong-path instruction prefetching

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Efficient procedure mapping using cache line coloring

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Prefetching using Markov predictors

Proceedings of the 24th annual international symposium on Computer architecture
Data prefetching on the HP PA-8000

Proceedings of the 24th annual international symposium on Computer architecture
Target prediction for indirect jumps

Proceedings of the 24th annual international symposium on Computer architecture
Reducing the performance impact of instruction cache misses by writing instructions into the reservation stations out-of-order

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Procedure placement using temporal ordering information

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Confidence estimation for speculation control

Proceedings of the 25th annual international symposium on Computer architecture
Accurate indirect branch prediction

Proceedings of the 25th annual international symposium on Computer architecture
The cascaded predictor: economical and adaptive branch target prediction

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
A scalable front-end architecture for fast instruction delivery

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
A comparison of software code reordering and victim buffers

ACM SIGARCH Computer Architecture News - Special issue on Interact-3 workshop
Fetch directed instruction prefetching

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Cache Memories

ACM Computing Surveys (CSUR)
Alto: a link-time optimizer for the Compaq alpha

Software—Practice & Experience
The MIPS R10000 Superscalar Microprocessor

IEEE Micro

Call graph prefetching for database applications

ACM Transactions on Computer Systems (TOCS)
Design and Implementation of a Compiler Framework for Helper Threading on Multi-core Processors

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
A PAB-based multi-prefetcher mechanism

International Journal of Parallel Programming
A low power front-end for embedded processors using a block-aware instruction set

CASES '07 Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems
A compiler-directed data prefetching scheme for chip multiprocessors

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
On reducing load/store latencies of cache accesses

Journal of Systems Architecture: the EUROMICRO Journal
Adaptive prefetching for shared cache based chip multiprocessors

Proceedings of the Conference on Design, Automation and Test in Europe
Reconciling real-time guarantees and energy efficiency through unlocked-cache prefetching

Proceedings of the 50th Annual Design Automation Conference
RDIP: return-address-stack directed instruction prefetching

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Instruction cache miss latency is becoming an increasingly important performance bottleneck, especially for commercial applications. Although instruction prefetching is an attractive technique for tolerating this latency, we find that existing prefetching schemes are insufficient for modern superscalar processors, since they fail to issue prefetches early enough (particularly for nonsequential accesses). To overcome these limitations, we propose a new instruction prefetching technique whereby the hardware and software cooperate to hide the latency as follows. The hardware performs aggressive sequential prefetching combined with a novel prefetch filtering mechanism to allow it to get far ahead without polluting the cache. To hide the latency of nonsequential accesses, we propose and implement a novel compiler algorithm which automatically inserts instruction-prefetch the targets of control transfers far enough in advance. Our experimental results demonstrate that this new approach hides 50% or more tof the latecy remaining with the best previous techniques, while at the same time reduces the number of useless prefetches by a factor of six. We find that both the prefetch filtering and compiler-inserted prefetching components of our design are essential and complementary, and that the compiler can limit the code expansion to only 9% on average. In addition, we show that the performance of our technique can be further increased by using profiling information to help reduce cache conflicts and unnecessary prefetches. From an architectural perspective, these performance advantages are sustained over a range of common miss latencies and bandwidth. Finally, our technique is cost effective as well, since it delivers performance comparable to (or even better than) that of larger caches, but requires a much smaller hardware budget.