ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
An effective on-chip preloading scheme to reduce data access penalty
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Design and evaluation of a compiler algorithm for prefetching
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Stride directed prefetching in scalar processors
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
ICS '93 Proceedings of the 7th international conference on Supercomputing
Hardware implementation issues of data prefetching
ICS '95 Proceedings of the 9th international conference on Supercomputing
Memory bandwidth limitations of future microprocessors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
A quantitative analysis of loop nest locality
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Profetching and memory system behavior of the SPEC95 benchmark suite
IBM Journal of Research and Development - Special issue: performance analysis and its impact on design
Hardware identification of cache conflict misses
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
ACM Computing Surveys (CSUR)
ACM Computing Surveys (CSUR)
Predictor-directed stream buffers
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Effective Hardware-Based Data Prefetching for High-Performance Processors
IEEE Transactions on Computers
The Memory Bandwidth Bottleneck and its Amelioration by a Compiler
IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
ATOM: a flexible interface for building high performance program analysis tools
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
WCAE '98 Proceedings of the 1998 workshop on Computer architecture education
Hi-index | 0.00 |
Techniques to reduce or tolerate large memory latencies are critical for achieving high processor performance. Hardware data prefetching is one of the most heavily studied solutions, but it is essentially applied to first-level caches where it can severely disrupt processor behavior by delaying normal cache requests, inducing cache pollution and occupying the heavily used bus to the second-level cache. In this article, we show that applying hardware data prefetching to the second level cache exhibits most of the benefits of first-level cache prefetching with almost none of its drawbacks. Moreover, we outline that second-level hardware data prefetching is particularly well suited to out-of-order (OoO) processors because it can hide the long memory latencies due to second-level cache misses while OoO execution of memory instructions can hide the lower latencies due to first-level cache misses that hit in the second-level cache. Finally, we show that when the full memory system is taken into account, especially bus traffic, first-level cache prefetching can actually degrade overall processor performance while second-level cache prefetching consistently improves overall performance. Our experimental results show that the instructions per cycle of floating-point programs (SPEC95) increases by 20% on a average using second-level cache hardware data prefetching while it decreases by 5% on a average using first-level cache hardware data prefetching.