Effective Hardware-Based Data Prefetching for High-Performance Processors

Authors:
Jean-Loup Baer;Tien-Fu Chen
Affiliations:
-;-
Venue:
IEEE Transactions on Computers
Year:
1995

Citing 22
Cited 129

Multiprocessor cache design considerations

ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
Strategies for cache and local memory management by global program transformation

Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
Multilevel cache hierarchies: organizations, protocols, and performance

Journal of Parallel and Distributed Computing
Tolerating latency through software-controlled prefetching in shared-memory multiprocessors

Journal of Parallel and Distributed Computing - Special issue on shared-memory multiprocessors
An architecture for software-controlled data prefetching

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Data prefetching in multiprocessor vector cache memories

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Data access microarchitectures for superscalar processors with compiler-assisted data prefetching

MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
An effective on-chip preloading scheme to reduce data access penalty

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Alternative implementations of two-level adaptive branch prediction

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Prefetch unit for vector operations on scalar computers

ACM SIGARCH Computer Architecture News
Reducing memory latency via non-blocking and prefetching caches

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Improving the accuracy of dynamic branch prediction using branch correlation

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Stride directed prefetching in scalar processors

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Data prefetching for high-performance processors

Data prefetching for high-performance processors
Compiler-directed data prefetching in multiprocessors with memory hierarchies

ICS '90 Proceedings of the 4th international conference on Supercomputing
The performance impact of block sizes and fetch strategies

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Cache Memories

ACM Computing Surveys (CSUR)
Decoupled access/execute computer architectures

ISCA '82 Proceedings of the 9th annual symposium on Computer Architecture
Lockup-free instruction fetch/prefetch cache organization

ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
Branch Target Buffer Design and Optimization

Branch Target Buffer Design and Optimization

A limit study of local memory requirements using value reuse profiles

Proceedings of the 28th annual international symposium on Microarchitecture
A modified approach to data cache management

Proceedings of the 28th annual international symposium on Microarchitecture
An effective programmable prefetch engine for on-chip caches

Proceedings of the 28th annual international symposium on Microarchitecture
A quantitative analysis of loop nest locality

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Examination of a memory access classification scheme for pointer-intensive and numeric programs

ICS '96 Proceedings of the 10th international conference on Supercomputing
Tango: a hardware-based data prefetching technique for superscalar processors

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
The performance potential of data dependence speculation & collapsing

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Prediction caches for superscalar processors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
The predictability of data values

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Highly accurate data value prediction using hybrid predictors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Memory data organization for improved cache performance in embedded processor applications

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Load execution latency reduction

ICS '98 Proceedings of the 12th international conference on Supercomputing
Hardware-driven prefetching for pointer data references

ICS '98 Proceedings of the 12th international conference on Supercomputing
CPU Cache Prefetching: Timing Evaluation of Hardware Implementations

IEEE Transactions on Computers
Using value prediction to increase the power of speculative execution hardware

ACM Transactions on Computer Systems (TOCS)
Predictive techniques for aggressive load speculation

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Dependence based prefetching for linked data structures

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Correlated load-address predictors

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Selective value prediction

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Memory forwarding: enabling aggressive layout optimizations by guaranteeing the safety of data relocation

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Cyclic dependence based data reference prediction

ICS '99 Proceedings of the 13th international conference on Supercomputing
Hardware identification of cache conflict misses

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Quantifying loop nest locality using SPEC'95 and the perfect benchmarks

ACM Transactions on Computer Systems (TOCS)
Limits of Data Value Predictability

International Journal of Parallel Programming
Instruction path coprocessors

Proceedings of the 27th annual international symposium on Computer architecture
Early load address resolution via register tracking

Proceedings of the 27th annual international symposium on Computer architecture
Data prefetch mechanisms

ACM Computing Surveys (CSUR)
Predictor-directed stream buffers

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
PipeRench implementation of the instruction path coprocessor

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Architecture of the Atlas Chip-Multiprocessor: Dynamically Parallelizing Irregular Applications

IEEE Transactions on Computers
Hardware prediction for data coherency of scientific codes on DSM

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Optimizations Enabled by a Decoupled Front-End Architecture

IEEE Transactions on Computers
Evaluating the impact of memory system performance on software prefetching and locality optimizations

ICS '01 Proceedings of the 15th international conference on Supercomputing
Dynamically allocating processor resources between nearby and distant ILP

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Locality vs. criticality

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Runtime identification of cache conflict misses: The adaptive miss buffer

ACM Transactions on Computer Systems (TOCS)
Profile-guided post-link stride prefetching

ICS '02 Proceedings of the 16th international conference on Supercomputing
Going the distance for TLB prefetching: an application-driven study

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Handling long-latency loads in a simultaneous multithreading processor

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Design and evaluation of compiler algorithms for pre-execution

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
When Caches Aren't Enough: Data Prefetching Techniques

Computer
Random-Access Data Storage Components in Customized Architectures

IEEE Design & Test
Increasing hardware data prefetching performance using the second-level cache

Journal of Systems Architecture: the EUROMICRO Journal
A Decoupled Predictor-Directed Stream Prefetching Architecture

IEEE Transactions on Computers
Sequential Unification and Aggressive Lookahead Mechanisms for Data Memory Accesses

PaCT '999 Proceedings of the 5th International Conference on Parallel Computing Technologies
Cost-Effective Compiler Directed Memory Prefetching and Bypassing

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Exploiting the Prefetching Effect Provided by Executing Mispredicted Load Instructions

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Design Considerations of High Performance Data Cache with Prefetching

Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
Speculative Prefetching of Induction Pointers

CC '01 Proceedings of the 10th International Conference on Compiler Construction
Pointer cache assisted prefetching

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
The AMD Opteron Processor for Multiprocessor Servers

IEEE Micro
Improving Performance for Software MPEG Players

COMPCON '96 Proceedings of the 41st IEEE International Computer Conference
Hardware Controlled Prefeching in Directory-Based Cache Coherent Systems

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Slipstream Execution Mode for CMP-Based Multiprocessors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Prefetching by Self-Contained Variables - a Generalization from Array to Recursive Data Structures

PAS '97 Proceedings of the 2nd AIZU International Symposium on Parallel Algorithms / Architecture Synthesis
Guided region prefetching: a cooperative hardware/software approach

Proceedings of the 30th annual international symposium on Computer architecture
Physical Experimentation with Prefetching Helper Threads on Intel's Hyper-Threaded Processors

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
A general framework for prefetch scheduling in linked data structures and its application to multi-chain prefetching

ACM Transactions on Computer Systems (TOCS)
Effective stream-based and execution-based data prefetching

Proceedings of the 18th annual international conference on Supercomputing
Cluster prefetch: tolerating on-chip wire delays in clustered microarchitectures

Proceedings of the 18th annual international conference on Supercomputing
An Efficient Value Predictor Dynamically Using Loop and Locality Properties

The Journal of Supercomputing
A study of source-level compiler algorithms for automatic construction of pre-execution code

ACM Transactions on Computer Systems (TOCS)
Managing Wire Delay in Large Chip-Multiprocessor Caches

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
The Impact of Incorrectly Speculated Memory Operations in a Multithreaded Architecture

IEEE Transactions on Parallel and Distributed Systems
Data Cache Prefetching Using a Global History Buffer

IEEE Micro
Addressing mode driven low power data caches for embedded processors

WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Memory-side prefetching for linked data structures for processor-in-memory systems

Journal of Parallel and Distributed Computing
High Efficiency Counter Mode Security Architecture via Prediction and Precomputation

Proceedings of the 32nd annual international symposium on Computer Architecture
On the performance of trace locality of reference

Performance Evaluation - Performance modelling and evaluation of high-performance parallel and distributed systems
Improving the Performance of Software Distributed Shared Memory with Speculation

IEEE Transactions on Parallel and Distributed Systems
WBTK: a New Set of Microbenchmarks to Explore Memory System Performance for Scientific Computing

International Journal of High Performance Computing Applications
Reducing latencies of pipelined cache accesses through set prediction

Proceedings of the 19th annual international conference on Supercomputing
A NUCA substrate for flexible CMP cache sharing

Proceedings of the 19th annual international conference on Supercomputing
Memory access pattern analysis and stream cache design for multimedia applications

ASP-DAC '03 Proceedings of the 2003 Asia and South Pacific Design Automation Conference
Intelligent memory manager: reducing cache pollution due to memory management functions

Journal of Systems Architecture: the EUROMICRO Journal
A PAB-based multi-prefetcher mechanism

International Journal of Parallel Programming
Design and evaluation of a hierarchical decoupled architecture

The Journal of Supercomputing
A comprehensive study of hardware/software approaches to improve TLB performance for java applications on embedded systems

Proceedings of the 2006 workshop on Memory system performance and correctness
Reducing Cache Pollution via Dynamic Data Prefetch Filtering

IEEE Transactions on Computers
Memory Prefetching Using Adaptive Stream Detection

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Accelerating memory decryption and authentication with frequent value prediction

Proceedings of the 4th international conference on Computing frontiers
Reducing non-deterministic loads in low-power caches via early cache set resolution

Microprocessors & Microsystems
Physical simulation for animation and visual effects: parallelization and characterization for chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
Optimizing software cache performance of packet processing applications

Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Optimal multistream sequential prefetching in a shared cache

ACM Transactions on Storage (TOS)
Data access history cache and associated data prefetching mechanisms

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
TaP: table-based prefetching for storage caches

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
HMTT: a platform independent full-system memory trace monitoring system

SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Server-based data push architecture for multi-processor environments

Journal of Computer Science and Technology
Combating I-O bottleneck using prefetching: model, algorithms, and ramifications

The Journal of Supercomputing
Parallelization, performance analysis, and algorithm consideration of Hough transform on chip multiprocessors

ACM SIGARCH Computer Architecture News
Power-Aware Software Prefetching

ICESS '07 Proceedings of the 3rd international conference on Embedded Software and Systems
Guided Prefetching Based on Runtime Access Patterns

ICCS '08 Proceedings of the 8th international conference on Computational Science, Part III
Memory resource allocation for file system prefetching: from a supply chain management perspective

Proceedings of the 4th ACM European conference on Computer systems
Inter-core cooperative TLB for chip multiprocessors

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
A hardware/software framework for instruction and data scratchpad memory allocation

ACM Transactions on Architecture and Code Optimization (TACO)
Algorithms for memory hierarchies: advanced lectures

Algorithms for memory hierarchies: advanced lectures
The bandwidth expansion effectiveness of cache levels block prefetch

ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
Reducing register file size through instruction pre-execution enhanced by value prediction

ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
Extending data prefetching to cope with context switch misses

ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
Timing local streams: improving timeliness in data prefetching

Proceedings of the 24th ACM International Conference on Supercomputing
An Adaptive Data Prefetcher for High-Performance Processors

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Engineering scalable, cache and space efficient tries for strings

The VLDB Journal — The International Journal on Very Large Data Bases
Improving cache locality for thread-level speculation

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Redesigning the string hash table, burst trie, and BST to exploit cache

Journal of Experimental Algorithmics (JEA)
Many-Thread Aware Prefetching Mechanisms for GPGPU Applications

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Inter-core prefetching for multicore processors using migrating helper threads

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Template-based memory access engine for accelerators in SoCs

Proceedings of the 16th Asia and South Pacific Design Automation Conference
Exploring the prefetcher/memory controller design space: an opportunistic prefetch scheduling strategy

ARCS'11 Proceedings of the 24th international conference on Architecture of computing systems
DRAM energy reduction by prefetching-based memory traffic clustering

Proceedings of the 21st edition of the great lakes symposium on Great lakes symposium on VLSI
Global-aware and multi-order context-based prefetching for high-performance processors

International Journal of High Performance Computing Applications
Algorithmic ramifications of prefetching in memory hierarchy

HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Optimizing packet accesses for a domain specific language on network processors

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Optimizing explicit data transfers for data parallel applications on the cell architecture

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
The migration prefetcher: Anticipating data promotion in dynamic NUCA caches

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Application-Specific hardware-driven prefetching to improve data cache performance

ACSAC'05 Proceedings of the 10th Asia-Pacific conference on Advances in Computer Systems Architecture
Targeted data prefetching

ACSAC'05 Proceedings of the 10th Asia-Pacific conference on Advances in Computer Systems Architecture
Multi-level hardware prefetching using low complexity delta correlating prediction tables with partial matching

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Cache-Conscious collision resolution in string hash tables

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
Side-channel vulnerability factor: a metric for measuring information leakage

Proceedings of the 39th Annual International Symposium on Computer Architecture
Pointy: a hybrid pointer prefetcher for managed runtime systems

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Hardware-software coherence protocol for the coexistence of caches and local memories

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Algorithm-level Feedback-controlled Adaptive data prefetcher: Accelerating data access for high-performance processors

Parallel Computing
TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs

ACM Transactions on Architecture and Code Optimization (TACO)
S/DC: a storage and energy efficient data prefetcher

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
APOGEE: adaptive prefetching on GPUs for energy efficiency

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Linearizing irregular memory accesses for improved correlated prefetching

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
EVA: an efficient vision architecture for mobile systems

Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems

Quantified Score

Hi-index	15.00

Visualization

Abstract

Memory latency and bandwidth are progressing at a much slower pace than processor performance. In this paper, we describe and evaluate the performance of three variations of a hardware function unit whose goal is to assist a data cache in prefetching data accesses so that memory latency is hidden as often as possible. The basic idea of the prefetching scheme is to keep track of data access patterns in a Reference Prediction Table (RPT) organized as an instruction cache. The three designs differ mostly on the timing of the prefetching. In the simplest scheme (basic), prefetches can be generated one iteration ahead of actual use. The lookahead variation takes advantage of a lookahead program counter that ideally stays one memory latency time ahead of the real program counter and that is used as the control mechanism to generate the prefetches. Finally the correlated scheme uses a more sophisticated design to detect patterns across loop levels.These designs are evaluated by simulating the ten SPEC benchmarks on a cycle-by-cycle basis. The results show that 1) the three hardware prefetching schemes all yield significant reductions in the data access penalty when compared with regular caches, 2) the benefits are greater when the hardware assist augments small on-chip caches, and 3) the lookahead scheme is the preferred one cost-performance wise.