Studying the impact of hardware prefetching and bandwidth partitioning in chip-multiprocessors

Authors:
Fang Liu;Yan Solihin
Affiliations:
North Carolina State University, Raleigh, NC, USA;North Carolina State University, Raleigh, NC, USA
Venue:
Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Year:
2011

Citing 26
Cited 4

Understanding some simple processor-performance limits

IBM Journal of Research and Development - Special issue: performance analysis and its impact on design
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Memory access scheduling

Proceedings of the 27th annual international symposium on Computer architecture
Symbiotic jobscheduling for a simultaneous multithreaded processor

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Simics: A Full System Simulation Platform

Computer
Effective Instruction Prefetching in Chip Multiprocessors for Modern Commercial Applications

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
POWER5 System microarchitecture

IBM Journal of Research and Development - POWER5 and packaging
Reducing Cache Pollution via Dynamic Data Prefetch Filtering

IEEE Transactions on Computers
Fair Queuing Memory Systems

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Effective Management of DRAM Bandwidth in Multicore Processors

PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
IBM POWER6 microarchitecture

IBM Journal of Research and Development
Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Self-Optimizing Memory Controllers: A Reinforcement Learning Approach

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Characterizing and modeling the behavior of context switch misses

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Prefetch-Aware DRAM Controllers

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Coordinated management of multiple interacting resources in chip multiprocessors: A machine learning approach

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Scaling the bandwidth wall: challenges in and avenues for CMP scaling

Proceedings of the 36th annual international symposium on Computer architecture
Architecture Support for Improving Bulk Memory Copying and Initialization Performance

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Coordinated control of multiple prefetchers in multi-core systems

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Understanding the behavior and implications of context switch misses

ACM Transactions on Architecture and Code Optimization (TACO)
CHOP: Integrating DRAM Caches for CMP Server Platforms

IEEE Micro
Architectural framework for supporting operating system survivability

HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
ACCESS: Smart scheduling for asymmetric cache CMPs

HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
SRP: symbiotic resource partitioning of the memory hierarchy in CMPs

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers

Cuanta: quantifying effects of shared on-chip resource interference for consolidated virtual machines

Proceedings of the 2nd ACM Symposium on Cloud Computing
Making data prefetch smarter: adaptive prefetching on POWER7

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
To hardware prefetch or not to prefetch?: a virtualized environment study and core binding approach

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
A performance-aware quality of service-driven scheduler for multicore processors

ACM SIGBED Review - Special Issue on the 3rd Embedded Operating System Workshop (EWiLi 2013)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern high performance microprocessors widely employ hardware prefetching technique to hide long memory access latency. While very useful, hardware prefetching tends to aggravate the bandwidth wall, a problem where system performance is increasingly limited by the availability of the off-chip pin bandwidth in Chip Multi-Processors (CMPs). In this paper, we propose an analytical model-based study to investigate how hardware prefetching and memory bandwidth partitioning impact CMP system performance and how they interact. The model includes a composite prefetching metric that can help determine under which conditions prefetching can improve system performance, a bandwidth partitioning model that takes into account prefetching effects, and a derivation of the weighted speedup optimum bandwidth partition sizes for different cores. Through model-driven case studies, we find several interesting observations that can be valuable for future CMP system design and optimization. We also explore simulation-based empirical evaluation to validate the observations and show that maximum system performance can be achieved by selective prefetching, guided by the composite prefetching metric, coupled with dynamic bandwidth partitioning.