Bandwidth constrained coordinated HW/SW prefetching for multicores

Authors:
Sai Prashanth Muralidhara;Mahmut Kandemir;Yuanrui Zhang
Affiliations:
Department of Computer Science and Engineering, Pennsylvania State University, PA;Department of Computer Science and Engineering, Pennsylvania State University, PA;Department of Computer Science and Engineering, Pennsylvania State University, PA
Venue:
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Year:
2011

Citing 21
Cited 0

Tolerating latency through software-controlled prefetching in shared-memory multiprocessors

Journal of Parallel and Distributed Computing - Special issue on shared-memory multiprocessors
An effective on-chip preloading scheme to reduce data access penalty

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Evaluating stream buffers as a secondary cache replacement

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Profetching and memory system behavior of the SPEC95 benchmark suite

IBM Journal of Research and Development - Special issue: performance analysis and its impact on design
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Memory access scheduling

Proceedings of the 27th annual international symposium on Computer architecture
Data prefetch mechanisms

ACM Computing Surveys (CSUR)
Simics: A Full System Simulation Platform

Computer
Sequential Hardware Prefetching in Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Fair Queuing Memory Systems

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Memory Prefetching Using Adaptive Stream Detection

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Fixed and Adaptive Sequential Prefetching in Shared Memory Multiprocessors

ICPP '93 Proceedings of the 1993 International Conference on Parallel Processing - Volume 01
Effective Management of DRAM Bandwidth in Multicore Processors

PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Self-Optimizing Memory Controllers: A Reinforcement Learning Approach

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Prefetch-Aware DRAM Controllers

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Coordinated control of multiple prefetchers in multi-core systems

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Prefetching is a highly effective latency hiding technique that can greatly improve application performance. However, aggressive prefetching can potentially stress the off-chip bandwidth. The resulting bandwidth stalls can potentially negate the performance gain due to prefetching. In this paper, focusing on a multicore environment, we first study the comparative benefits of hardware and software prefetching and analyze if the two are complimentary or redundant. This analysis also evaluates different aggressiveness levels of hardware prefetching. Secondly, we weigh the positive performance benefits of prefetching against the negative performance effects of bandwidth stalls. Thirdly, we propose a hierarchical prefetch management scheme for multicores that controls the prefetch levels such that the overall performance gain is improved. Lastly, we show that our proposed off-chip bandwidth aware prefetch management scheme is very effective in practice, leading to performance gains of upto about 10% in system throughput over a bandwidth agnostic prefetching scheme.