Guided region prefetching: a cooperative hardware/software approach

Authors:
Zhenlin Wang;Doug Burger;Kathryn S. McKinley;Steven K. Reinhardt;Charles C. Weems
Affiliations:
Univ. of Massachusetts, Amherst;The University of Texas at Austin;The University of Texas at Austin;University of Michigan;Univ. of Massachusetts, Amherst
Venue:
Proceedings of the 30th annual international symposium on Computer architecture
Year:
2003

Citing 39
Cited 40

Software prefetching

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
An architecture for software-controlled data prefetching

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Reducing memory latency via non-blocking and prefetching caches

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Evaluating stream buffers as a secondary cache replacement

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Compiler optimizations for improving data locality

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Speeding up irregular applications in shared-memory multiprocessors: memory binding and group prefetching

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
SPAID: software prefetching in pointer- and call-intensive environments

Proceedings of the 28th annual international symposium on Microarchitecture
An effective programmable prefetch engine for on-chip caches

Proceedings of the 28th annual international symposium on Microarchitecture
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
Compiler-based prefetching for recursive data structures

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Prefetching using Markov predictors

Proceedings of the 24th annual international symposium on Computer architecture
Dependence based prefetching for linked data structures

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Precise miss analysis for program transformations with caches of arbitrary associativity

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Effective jump-pointer prefetching for linked data structures

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Push vs. pull: data movement for linked data structures

Proceedings of the 14th international conference on Supercomputing
Cache Memories

ACM Computing Surveys (CSUR)
Predictor-directed stream buffers

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Speculative precomputation: long-range prefetching of delinquent loads

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Data prefetching by dependence graph precomputation

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Dead-block prediction & dead-block correlating prefetchers

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Using a user-level memory thread for correlation prefetching

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Simple and effective array prefetching in Java

JGI '02 Proceedings of the 2002 joint ACM-ISCOPE conference on Java Grande
Design and evaluation of compiler algorithms for pre-execution

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
A stateless, content-directed data prefetching mechanism

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Effective Hardware-Based Data Prefetching for High-Performance Processors

IEEE Transactions on Computers
Hybrid compiler/hardware prefetching for multiprocessors using low-overhead cache miss traps

ICPP '97 Proceedings of the international Conference on Parallel Processing
Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Exploring the Design Space of Future CMPs

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Data Flow Analysis for Software Prefetching Linked Data Structures in Java

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Effectiveness of hardware-based stride and sequential prefetching in shared-memory multiprocessors

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Distributed Prefetch-buffer/Cache Design for High Performance Memory Systems

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
A Compiler-Assisted Data Prefetch Controller

ICCD '99 Proceedings of the 1999 IEEE International Conference on Computer Design
Memory-Side Prefetching for Linked Data Structures

Memory-Side Prefetching for Linked Data Structures
Reducing DRAM Latencies with an Integrated Memory Hierarchy Design

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture

The Performance of Runtime Data Cache Prefetching in a Dynamic Optimization System

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
High Efficiency Counter Mode Security Architecture via Prediction and Precomputation

Proceedings of the 32nd annual international symposium on Computer Architecture
On the performance of trace locality of reference

Performance Evaluation - Performance modelling and evaluation of high-performance parallel and distributed systems
Compiling for EDGE Architectures

Proceedings of the International Symposium on Code Generation and Optimization
Spatial Memory Streaming

Proceedings of the 33rd annual international symposium on Computer Architecture
A combined DMA and application-specific prefetching approach for tackling the memory latency bottleneck

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Overlapping dependent loads with addressless preload

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Loop Scheduling with Complete Memory Latency Hiding on Multi-core Architecture

ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
Data prefetching in a cache hierarchy with high bandwidth and capacity

MEDEA '06 Proceedings of the 2006 workshop on MEmory performance: DEaling with Applications, systems and architectures
Stealth prefetching

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Reducing Cache Pollution via Dynamic Data Prefetch Filtering

IEEE Transactions on Computers
Memory Prefetching Using Adaptive Stream Detection

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Accelerating memory decryption and authentication with frequent value prediction

Proceedings of the 4th international conference on Computing frontiers
Comparing memory systems for chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
Data prefetching in a cache hierarchy with high bandwidth and capacity

ACM SIGARCH Computer Architecture News
Predictor virtualization

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
HMTT: a platform independent full-system memory trace monitoring system

SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Focused prefetching: performance oriented prefetching based on commit stalls

Proceedings of the 22nd annual international conference on Supercomputing
Effective loop partitioning and scheduling under memory and register dual constraints

Proceedings of the conference on Design, automation and test in Europe
Capturing and optimizing the interactions between prefetching and cache line turnoff

Microprocessors & Microsystems
Comparative evaluation of memory models for chip multiprocessors

ACM Transactions on Architecture and Code Optimization (TACO)
A compiler-directed data prefetching scheme for chip multiprocessors

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Access region cache with register guided memory reference partitioning

Journal of Systems Architecture: the EUROMICRO Journal
Iterational retiming with partitioning: Loop scheduling with complete memory latency hiding

ACM Transactions on Embedded Computing Systems (TECS)
Engineering scalable, cache and space efficient tries for strings

The VLDB Journal — The International Journal on Very Large Data Bases
Adaptive prefetching for shared cache based chip multiprocessors

Proceedings of the Conference on Design, Automation and Test in Europe
Coterminous locality and coterminous group data prefetching on chip-multiprocessors

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Redesigning the string hash table, burst trie, and BST to exploit cache

Journal of Experimental Algorithmics (JEA)
Extended histories: improving regularity and performance in correlation prefetchers

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Tackling cache-line stealing effects using run-time adaptation

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Energy-efficient hardware data prefetching

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Optimizing explicit data transfers for data parallel applications on the cell architecture

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
When Prefetching Works, When It Doesn’t, and Why

ACM Transactions on Architecture and Code Optimization (TACO)
Dynamic partition of memory reference instructions – a register guided approach

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
SAD prefetching for MPEG4 using flux caches

SAMOS'06 Proceedings of the 6th international conference on Embedded Computer Systems: architectures, Modeling, and Simulation
Pointy: a hybrid pointer prefetcher for managed runtime systems

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Prefetching and cache management using task lifetimes

Proceedings of the 27th international ACM conference on International conference on supercomputing
Network-aware data caching and prefetching for cloud-hosted metadata retrieval

NDM '13 Proceedings of the Third International Workshop on Network-Aware Data Management
A dynamic adaptive converter and management for PRAM-based main memory

Microprocessors & Microsystems
Linearizing irregular memory accesses for improved correlated prefetching

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Despite large caches, main-memory access latencies still cause significant performance losses in many applications. Numerous hardware and software prefetching schemes have been proposed to tolerate these latencies. Software prefetching typically provides better prefetch accuracy than hardware, but is limited by prefetch instruction overheads and the compiler's limited ability to schedule prefetches sufficiently far in advance to cover level-two cache miss latencies. Hardware prefetching can be effective at hiding these large latencies, but generates many useless prefetches and consumes considerable memory bandwidth. In this paper, we propose a cooperative hardware-software prefetching scheme called Guided Region Prefetching (GRP), which uses compiler-generated hints encoded in load instructions to regulate an aggressive hardware prefetching engine. We compare GRP against a sophisticated pure hardware stride prefetcher and a scheduled region prefetching (SRP) engine. SRP and GRP show the best performance, with respective 22% and 21% gains over no prefetching, but SRP incurs 180% extra memory traffic---nearly tripling bandwidth requirements. GRP achieves performance close to SRP, but with a mere eighth of the extra prefetching traffic, a 23% increase over no prefetching. The GRP hardware-software collaboration thus combines the accuracy of compilerbased program analysis with the performance potential of aggressive hardware prefetching, bringing the performance gap versus a perfect L2 cache under 20%.