Compilers: principles, techniques, and tools
Compilers: principles, techniques, and tools
Achieving high instruction cache performance with an optimizing compiler
ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Profile guided code positioning
PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
MIPS RISC architectures
Design and evaluation of a compiler algorithm for prefetching
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Prefetching in supercomputer instruction caches
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Contrasting characteristics and cache performance of technical and multi-user commercial workloads
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Compiler techniques for data prefetching on the PowerPC
PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Instruction prefetching of systems codes with layout optimized for reduced cache misses
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Wrong-path instruction prefetching
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Efficient procedure mapping using cache line coloring
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Prefetching using Markov predictors
Proceedings of the 24th annual international symposium on Computer architecture
Data prefetching on the HP PA-8000
Proceedings of the 24th annual international symposium on Computer architecture
Target prediction for indirect jumps
Proceedings of the 24th annual international symposium on Computer architecture
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Procedure placement using temporal ordering information
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Confidence estimation for speculation control
Proceedings of the 25th annual international symposium on Computer architecture
Accurate indirect branch prediction
Proceedings of the 25th annual international symposium on Computer architecture
The cascaded predictor: economical and adaptive branch target prediction
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
A scalable front-end architecture for fast instruction delivery
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
A comparison of software code reordering and victim buffers
ACM SIGARCH Computer Architecture News - Special issue on Interact-3 workshop
Fetch directed instruction prefetching
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
ACM Computing Surveys (CSUR)
Alto: a link-time optimizer for the Compaq alpha
Software—Practice & Experience
The MIPS R10000 Superscalar Microprocessor
IEEE Micro
Call graph prefetching for database applications
ACM Transactions on Computer Systems (TOCS)
Design and Implementation of a Compiler Framework for Helper Threading on Multi-core Processors
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
A PAB-based multi-prefetcher mechanism
International Journal of Parallel Programming
A low power front-end for embedded processors using a block-aware instruction set
CASES '07 Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems
A compiler-directed data prefetching scheme for chip multiprocessors
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
On reducing load/store latencies of cache accesses
Journal of Systems Architecture: the EUROMICRO Journal
Adaptive prefetching for shared cache based chip multiprocessors
Proceedings of the Conference on Design, Automation and Test in Europe
Reconciling real-time guarantees and energy efficiency through unlocked-cache prefetching
Proceedings of the 50th Annual Design Automation Conference
RDIP: return-address-stack directed instruction prefetching
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Hi-index | 0.00 |
Instruction cache miss latency is becoming an increasingly important performance bottleneck, especially for commercial applications. Although instruction prefetching is an attractive technique for tolerating this latency, we find that existing prefetching schemes are insufficient for modern superscalar processors, since they fail to issue prefetches early enough (particularly for nonsequential accesses). To overcome these limitations, we propose a new instruction prefetching technique whereby the hardware and software cooperate to hide the latency as follows. The hardware performs aggressive sequential prefetching combined with a novel prefetch filtering mechanism to allow it to get far ahead without polluting the cache. To hide the latency of nonsequential accesses, we propose and implement a novel compiler algorithm which automatically inserts instruction-prefetch the targets of control transfers far enough in advance. Our experimental results demonstrate that this new approach hides 50% or more tof the latecy remaining with the best previous techniques, while at the same time reduces the number of useless prefetches by a factor of six. We find that both the prefetch filtering and compiler-inserted prefetching components of our design are essential and complementary, and that the compiler can limit the code expansion to only 9% on average. In addition, we show that the performance of our technique can be further increased by using profiling information to help reduce cache conflicts and unnecessary prefetches. From an architectural perspective, these performance advantages are sustained over a range of common miss latencies and bandwidth. Finally, our technique is cost effective as well, since it delivers performance comparable to (or even better than) that of larger caches, but requires a much smaller hardware budget.