Dynamic hot data stream prefetching for general-purpose programs

Authors:
Trishul M. Chilimbi;Martin Hirzel
Affiliations:
Microsoft Research, Redmond, WA;University of Colorado, Boulder, CO
Venue:
PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Year:
2002

Citing 27
Cited 57

An architecture for software-controlled data prefetching

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Reducing memory latency via non-blocking and prefetching caches

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
ATOM: a system for building customized program analysis tools

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Compiler-based prefetching for recursive data structures

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Prefetching using Markov predictors

Proceedings of the 24th annual international symposium on Computer architecture
Using generational garbage collection to implement cache-conscious data placement

Proceedings of the 1st international symposium on Memory management
Dependence based prefetching for linked data structures

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Effective jump-pointer prefetching for linked data structures

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Whole program paths

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Dynamo: a transparent dynamic optimization system

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Practicing JUDO: Java under dynamic optimizations

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Adaptive optimization in the Jalapeño JVM

OOPSLA '00 Proceedings of the 15th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Automated data-member layout of heap objects to improve memory-hierarchy performance

ACM Transactions on Programming Languages and Systems (TOPLAS)
Data prefetch mechanisms

ACM Computing Surveys (CSUR)
Dynamic adaptive pre-tenuring

Proceedings of the 2nd international symposium on Memory management
A framework for reducing the cost of instrumented code

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Efficient representations and abstractions for quantifying and exploiting data reference locality

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Data prefetching by dependence graph precomputation

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Automated design of finite state machine predictors for customized processors

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
An efficient profile-analysis framework for data-layout optimizations

POPL '02 Proceedings of the 29th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
On the Stability of Temporal Data Reference Profiles

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Data Flow Analysis for Software Prefetching Linked Data Structures in Java

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Linear-Time, Incremental Hierarchy Inference for Compression

DCC '97 Proceedings of the Conference on Data Compression
Improving the Effectiveness of Software Prefetching with Adaptive Execution

PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques
The java hotspotTM server compiler

JVM'01 Proceedings of the 2001 Symposium on JavaTM Virtual Machine Research and Technology Symposium - Volume 1

Online feedback-directed optimization of Java

OOPSLA '02 Proceedings of the 17th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Data remapping for design space optimization of embedded memory systems

ACM Transactions on Embedded Computing Systems (TECS)
Pointer cache assisted prefetching

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Predicting whole-program locality through reuse distance analysis

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Stride prefetching by dynamically inspecting objects

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
The Performance of Runtime Data Cache Prefetching in a Dynamic Optimization System

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Exposing Memory Access Regularities Using Object-Relative Memory Profiling

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Prefetch injection based on hardware monitoring and object metadata

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Low-overhead memory leak detection using adaptive statistical profiling

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Whole Execution Traces

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
A performance study of data layout techniques for improving data locality in refinement-based pathfinding

Journal of Experimental Algorithmics (JEA)
Memory Profiling using Hardware Counters

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Fast data-locality profiling of native execution

SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Temporal Streaming of Shared Memory

Proceedings of the 32nd annual international symposium on Computer Architecture
Whole execution traces and their applications

ACM Transactions on Architecture and Code Optimization (TACO)
HPS: Hybrid Profiling Support

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Store-Ordered Streaming of Shared Memory

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Dynamic Helper Threaded Prefetching on the Sun UltraSPARC CMP Processor

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
A Self-Repairing Prefetcher in an Event-Driven Dynamic Optimization Framework

Proceedings of the International Symposium on Code Generation and Optimization
Online Phase Detection Algorithms

Proceedings of the International Symposium on Code Generation and Optimization
Profiling over Adaptive Ranges

Proceedings of the International Symposium on Code Generation and Optimization
Dynamic memory optimization using pool allocation and prefetching

ACM SIGARCH Computer Architecture News - Special issue on the 2005 workshop on binary instrumentation and application
ALITER: an asynchronous lightweight instrumentation tool for event recording

ACM SIGARCH Computer Architecture News - Special issue on the 2005 workshop on binary instrumentation and application
Improving locality with parallel hierarchical copying GC

Proceedings of the 5th international symposium on Memory management
Decomposing memory performance: data structures and phases

Proceedings of the 5th international symposium on Memory management
Online performance auditing: using hot optimizations without getting burned

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Address-Value Delta (AVD) Prediction: A Hardware Technique for Efficiently Parallelizing Dependent Cache Misses

IEEE Transactions on Computers
Locality approximation using time

Proceedings of the 34th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Software behavior oriented parallelization

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Shadow Profiling: Hiding Instrumentation Costs with Parallelism

Proceedings of the International Symposium on Code Generation and Optimization
Performance driven data cache prefetching in a dynamic software optimization system

Proceedings of the 21st annual international conference on Supercomputing
Phase-based adaptive recompilation in a JVM

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Formulating and implementing profiling over adaptive ranges

ACM Transactions on Architecture and Code Optimization (TACO)
Sampling-based program locality approximation

Proceedings of the 7th international symposium on Memory management
Automatic Prefetching with Binary Code Rewriting in Object-Based DSMs

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
QVM: an efficient runtime for detecting defects in deployed systems

Proceedings of the 23rd ACM SIGPLAN conference on Object-oriented programming systems languages and applications
PFetch: software prefetching exploiting temporal predictability of memory access streams

Proceedings of the 9th workshop on MEmory performance: DEaling with Applications, systems and architecture
Placement optimization using data context collected during garbage collection

Proceedings of the 2009 international symposium on Memory management
Fast Track: A Software System for Speculative Program Optimization

Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
Scenario Based Optimization: A Framework for Statically Enabling Online Optimizations

Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
Program locality analysis using reuse distance

ACM Transactions on Programming Languages and Systems (TOPLAS)
Spatio-temporal memory streaming

Proceedings of the 36th annual international symposium on Computer architecture
Tree-traversal orientation analysis

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
A graph theoretic approach to cache-conscious placement of data for direct mapped caches

Proceedings of the 2010 international symposium on Memory management
Coterminous locality and coterminous group data prefetching on chip-multiprocessors

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Inter-core prefetching for multicore processors using migrating helper threads

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Memory Trace Compression and Replay for SPMD Systems using Extended PRSDs?

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Analysis and performance results of computing betweenness centrality on IBM Cyclops64

The Journal of Supercomputing
Energy-efficient hardware data prefetching

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
QVM: An Efficient Runtime for Detecting Defects in Deployed Systems

ACM Transactions on Software Engineering and Methodology (TOSEM)
smt-SPRINTS: software precomputation with intelligent streaming for resource-constrained SMTs

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
VMAD: an advanced dynamic program analysis and instrumentation framework

CC'12 Proceedings of the 21st international conference on Compiler Construction
Exploiting inter-sequence correlations for program behavior prediction

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Application data prefetching on the IBM blue gene/Q supercomputer

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Esodyp+: prefetching in the Jackal software DSM

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
HOTL: a higher order theory of locality

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
A coldness metric for cache optimization

Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness

Quantified Score

Hi-index	0.00

Visualization

Abstract

Prefetching data ahead of use has the potential to tolerate the grow ing processor-memory performance gap by overlapping long latency memory accesses with useful computation. While sophisti cated prefetching techniques have been automated for limited domains, such as scientific codes that access dense arrays in loop nests, a similar level of success has eluded general-purpose pro grams, especially pointer-chasing codes written in languages such as C and C++. We address this problem by describing, implementing and evaluating a dynamic prefetching scheme. Our technique runs on stock hardware, is completely automatic, and works for general-purpose programs, including pointer-chasing codes written in weakly-typed languages, such as C and C++. It operates in three phases. First, the profiling phase gathers a temporal data reference profile from a running program with low-overhead. Next, the profiling is turned off and a fast analysis algorithm extracts hot data streams, which are data reference sequences that frequently repeat in the same order, from the temporal profile. Then, the system dynamically injects code at appropriate program points to detect and prefetch these hot data streams. Finally, the process enters the hibernation phase where no profiling or analysis is performed, and the program continues to execute with the added prefetch instructions. At the end of the hibernation phase, the program is de-optimized to remove the inserted checks and prefetch instructions, and control returns to the profiling phase. For long-running programs, this profile, analyze and optimize, hibernate, cycle will repeat multiple times. Our initial results from applying dynamic prefetching are promising, indicating overall execution time improvements of 5.19% for several memory-performance-limited SPECint2000 benchmarks running their largest (ref) inputs.