Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors

Authors:
Ilya Ganusov;Martin Burtscher
Affiliations:
Computer Systems Laboratory Cornell University;Computer Systems Laboratory Cornell University
Venue:
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Year:
2005

Citing 21
Cited 15

Evaluating stream buffers as a secondary cache replacement

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Multiscalar processors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Value locality and load value prediction

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Improving data cache performance by pre-executing instructions under a cache miss

ICS '97 Proceedings of the 11th international conference on Supercomputing
The predictability of data values

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Threaded multiple path execution

Proceedings of the 25th annual international symposium on Computer architecture
Dependence based prefetching for linked data structures

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Predictor-directed stream buffers

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
A study of slipstream processors

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Slice-processors: an implementation of operation-based prediction

ICS '01 Proceedings of the 15th international conference on Supercomputing
Execution-based prediction using speculative slices

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Dynamic speculative precomputation

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Automatically characterizing large scale program behavior

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Enhancing memory level parallelism via recovery-free value prediction

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization

HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Speculative Data-Driven Multithreading

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Differential FCM: Increasing Value Prediction Accuracy by Improving Table Usage Efficiency

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Checkpointed Early Load Retirement

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques

Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
A Self-Repairing Prefetcher in an Event-Driven Dynamic Optimization Framework

Proceedings of the International Symposium on Code Generation and Optimization
Efficient emulation of hardware prefetchers via event-driven helper threading

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Future execution: A prefetching mechanism that uses multiple cores to speed up single threads

ACM Transactions on Architecture and Code Optimization (TACO)
Data access history cache and associated data prefetching mechanisms

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Server-based data push architecture for multi-processor environments

Journal of Computer Science and Technology
Skewed redundancy

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
A low-complexity microprocessor design with speculative pre-execution

Journal of Systems Architecture: the EUROMICRO Journal
A complexity-effective microprocessor design with decoupled dispatch queues and prefetching

Parallel Computing
An Adaptive Data Prefetcher for High-Performance Processors

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Speculative-aware execution: a simple and efficient technique for utilizing multi-cores to improve single-thread performance

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Characterizing the impact of using spare-cores on application performance

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Analysis and performance results of computing betweenness centrality on IBM Cyclops64

The Journal of Supercomputing
A hybrid hardware/software generated prefetching thread mechanism on chip multiprocessors

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Algorithm-level Feedback-controlled Adaptive data prefetcher: Accelerating data access for high-performance processors

Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a new hardware technique for using one core of a CMP to prefetch data for a thread running on another core. Our approach simply executes a copy of all non-control instructions in the prefetching core after they have executed in the primary core. On the way to the second core, each instructionýs output is replaced by a prediction of the likely output that the nth future instance of this instruction will produce. Speculatively executing the resulting instruction stream on the second core issues load requests that the main program will probably reference in the future. Unlike previously proposed thread-based prefetching approaches, our technique does not need any thread spawning points, features an adjustable lookahead distance, does not require complicated analyzers to extract prefetching threads, is recovery-free, and necessitates no storage for the prefetching threads. We demonstrate that for the SPECcpu2000 benchmark suite, our mechanismsignificantly increases the prefetching coverage and improves the primary coreýs performance by 10% on average over a baseline that already includes an aggressive hardware stream prefetcher. We further show that our approach works well in combination with runahead execution.