Performance driven data cache prefetching in a dynamic software optimization system
Proceedings of the 21st annual international conference on Supercomputing
Focused prefetching: performance oriented prefetching based on commit stalls
Proceedings of the 22nd annual international conference on Supercomputing
Analyzing memory access intensity in parallel programs on multicore
Proceedings of the 22nd annual international conference on Supercomputing
Low-Cost Adaptive Data Prefetching
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Adaptive insertion policies for managing shared caches
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Prefetch-Aware DRAM Controllers
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Coordinated control of multiple prefetchers in multi-core systems
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Improving memory bank-level parallelism in the presence of prefetching
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Timing local streams: improving timeliness in data prefetching
Proceedings of the 24th ACM International Conference on Supercomputing
High performance cache replacement using re-reference interval prediction (RRIP)
Proceedings of the 37th annual international symposium on Computer architecture
An Adaptive Data Prefetcher for High-Performance Processors
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Many-Thread Aware Prefetching Mechanisms for GPGPU Applications
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Extended histories: improving regularity and performance in correlation prefetchers
Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Cache equalizer: a placement mechanism for chip multiprocessor distributed shared caches
Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Resolving a L2-prefetch-caused parallel nonscaling on Intel Core microarchitecture
Journal of Parallel and Distributed Computing
Studying the impact of hardware prefetching and bandwidth partitioning in chip-multiprocessors
Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Prefetch-aware shared resource management for multi-core systems
Proceedings of the 38th annual international symposium on Computer architecture
Studying the impact of hardware prefetching and bandwidth partitioning in chip-multiprocessors
ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
Understanding stencil code performance on multicore architectures
Proceedings of the 8th ACM International Conference on Computing Frontiers
Bandwidth constrained coordinated HW/SW prefetching for multicores
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Using runtime activity to dynamically filter out inefficient data prefetches
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Global-aware and multi-order context-based prefetching for high-performance processors
International Journal of High Performance Computing Applications
ABS: A low-cost adaptive controller for prefetching in a banked shared last-level cache
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
When Prefetching Works, When It Doesn’t, and Why
ACM Transactions on Architecture and Code Optimization (TACO)
CRUISE: cache replacement and utility-aware scheduling
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
REEact: a customizable virtual execution manager for multicore platforms
VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments
SHiP: signature-based hit predictor for high performance caching
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
PACMan: prefetch-aware cache management for high performance caching
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
ACM Transactions on Computer Systems (TOCS)
Improving writeback efficiency with decoupled last-write prediction
Proceedings of the 39th Annual International Symposium on Computer Architecture
Base-delta-immediate compression: practical data compression for on-chip caches
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Application-aware prefetch prioritization in on-chip networks
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Hardware prefetchers for emerging parallel applications
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
To hardware prefetch or not to prefetch?: a virtualized environment study and core binding approach
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Diagnosis and optimization of application prefetching performance
Proceedings of the 27th international ACM conference on International conference on supercomputing
Coordinating prefetching and STT-RAM based last-level cache management for multicore systems
Proceedings of the 23rd ACM international conference on Great lakes symposium on VLSI
Improving memory scheduling via processor-side load criticality information
Proceedings of the 40th Annual International Symposium on Computer Architecture
Orchestrated scheduling and prefetching for GPGPUs
Proceedings of the 40th Annual International Symposium on Computer Architecture
APOGEE: adaptive prefetching on GPUs for energy efficiency
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Meeting midway: improving CMP performance with memory-side prefetching
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Linearly compressed pages: a low-complexity, low-latency main memory compression framework
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
RowClone: fast and energy-efficient in-DRAM bulk data copy and initialization
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
The reuse cache: downsizing the shared last-level cache
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Practical models for energy-efficient prefetching in mobile embedded systems
Microprocessors & Microsystems
Mortar: filling the gaps in data center memory
Proceedings of the 10th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Hi-index | 0.00 |
High performance processors employ hardware data prefetching to reduce the negative performance impact of large main memory latencies. While prefetching improves performance substantially on many programs, it can significantly reduce performance on others. Also, prefetching can significantly increase memory bandwidth requirements. This paper proposes a mechanism that incorporates dynamic feedback into the design of the prefetcher to increase the performance improvement provided by prefetching as well as to reduce the negative performance and bandwidth impact of prefetching. Our mechanism estimates prefetcher accuracy, prefetcher timeliness, and prefetcher-caused cache pollution to adjust the aggressiveness of the data prefetcher dynamically. We introduce a new method to track cache pollution caused by the prefetcher at run-time. We also introduce a mechanism that dynamically decides where in the LRU stack to insert the prefetched blocks in the cache based on the cache pollution caused by the prefetcher. Using the proposed dynamic mechanism improves average performance by 6.5% on 17 memory-intensive benchmarks in the SPEC CPU2000 suite compared to the best-performing conventional stream-based data prefetcher configuration, while it consumes 18.7% less memory bandwidth. Compared to a conventional stream-based data prefetcher configuration that consumes similar amount of memory bandwidth, feedback directed prefetching provides 13.6% higher performance. Our results show that feedback-directed prefetching eliminates the large negative performance impact incurred on some benchmarks due to prefetching, and it is applicable to stream-based prefetchers, global-history-buffer based delta correlation prefetchers, and PC-based stride prefetchers.