Dynamic speculative precomputation

Authors:
Jamison D. Collins;Dean M. Tullsen;Hong Wang;John P. Shen
Affiliations:
University of California, San Diego, La Jolla, CA;University of California, San Diego, La Jolla, CA;Intel Corporation, Santa Clara, CA;Intel Corporation, Santa Clara, CA
Venue:
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Year:
2001

Citing 21
Cited 70

Compilers: principles, techniques, and tools

Compilers: principles, techniques, and tools
Tolerating latency through software-controlled prefetching in shared-memory multiprocessors

Journal of Parallel and Distributed Computing - Special issue on shared-memory multiprocessors
Simultaneous multithreading: maximizing on-chip parallelism

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
An effective programmable prefetch engine for on-chip caches

Proceedings of the 28th annual international symposium on Microarchitecture
Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Olden: parallelizing programs with dynamic data structures on distributed-memory machines

Olden: parallelizing programs with dynamic data structures on distributed-memory machines
Trace cache: a low latency approach to high bandwidth instruction fetching

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Prefetching using Markov predictors

Proceedings of the 24th annual international symposium on Computer architecture
Putting the fill unit to work: dynamic optimizations for trace cache microprocessors

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Dependence based prefetching for linked data structures

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Instruction path coprocessors

Proceedings of the 27th annual international symposium on Computer architecture
Slice-processors: an implementation of operation-based prediction

ICS '01 Proceedings of the 15th international conference on Supercomputing
Execution-based prediction using speculative slices

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Slipstream processors: improving both performance and fault tolerance

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Speculative precomputation: long-range prefetching of delinquent loads

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Data prefetching by dependence graph precomputation

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Handling long-latency loads in a simultaneous multithreading processor

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Speculative Data-Driven Multithreading

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Dynamic Prediction of Critical Path Instructions

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture

Post-pass binary adaptation for software-based speculative precomputation

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Difficult-path branch prediction using subordinate microthreads

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Design and evaluation of compiler algorithms for pre-execution

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
A Decoupled Predictor-Directed Stream Prefetching Architecture

IEEE Transactions on Computers
Quantifying Instruction Criticality

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Transparent Threads: Resource Sharing in SMT Processors for High Single-Thread Performance

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
A Programmable Memory Hierarchy for Prefetching Linked Data Structures

ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
Pointer cache assisted prefetching

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Microarchitectural support for precomputation microthreads

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
A quantitative framework for automated pre-execution thread selection

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
A framework for modeling and optimization of prescient instruction prefetch

SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Recycling waste: exploiting wrong-path execution to improve branch prediction

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Slipstream Execution Mode for CMP-Based Multiprocessors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Beating in-order stalls with "flea-flicker" two-pass pipelining

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
A Simple Mechanism for Detecting Ineffectual Instructions in Slipstream Processors

IEEE Transactions on Computers
A first glance at Kilo-instruction based multiprocessors

Proceedings of the 1st conference on Computing frontiers
Physical Experimentation with Prefetching Helper Threads on Intel's Hyper-Threaded Processors

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
A general framework for prefetch scheduling in linked data structures and its application to multi-chain prefetching

ACM Transactions on Computer Systems (TOCS)
Data forwarding through in-memory precomputation threads

Proceedings of the 18th annual international conference on Supercomputing
SMTp: An Architecture for Next-generation Scalable Multi-threading

Proceedings of the 31st annual international symposium on Computer architecture
A study of source-level compiler algorithms for automatic construction of pre-execution code

ACM Transactions on Computer Systems (TOCS)
Helper threads via virtual multithreading on an experimental itanium® 2 processor-based platform

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Compiler orchestrated prefetching via speculation and predication

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Decoupled Software Pipelining with the Synchronization Array

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Area and System Clock Effects on SMT/CMP Throughput

IEEE Transactions on Computers
Dynamic Strands: Collapsing Speculative Dependence Chains for Reducing Pipeline Communication

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Helper Threads via Virtual Multithreading

IEEE Micro
Toward kilo-instruction processors

ACM Transactions on Architecture and Code Optimization (TACO)
Runtime support for integrating precomputation and thread-level parallelism on simultaneous multithreaded processors

LCR '04 Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems
Temporal Streaming of Shared Memory

Proceedings of the 32nd annual international symposium on Computer Architecture
Energy-Effectiveness of Pre-Execution and Energy-Aware P-Thread Selection

Proceedings of the 32nd annual international symposium on Computer Architecture
Techniques for Efficient Processing in Runahead Execution Engines

Proceedings of the 32nd annual international symposium on Computer Architecture
High-Performance Throughput Computing

IEEE Micro
Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Speculative execution for hiding memory latency

MEDEA '04 Proceedings of the 2004 workshop on MEmory performance: DEaling with Applications , systems and architecture
Beating In-Order Stalls with "Flea-Flicker" Two-Pass Pipelining

IEEE Transactions on Computers
Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance

IEEE Micro
A Self-Repairing Prefetcher in an Event-Driven Dynamic Optimization Framework

Proceedings of the International Symposium on Code Generation and Optimization
Kilo-instruction processors, runahead and prefetching

Proceedings of the 3rd conference on Computing frontiers
Program Demultiplexing: Data-flow based Speculative Parallelization of Methods in Sequential Programs

Proceedings of the 33rd annual international symposium on Computer Architecture
HeapMon: a helper-thread approach to programmable, automatic, and low-overhead memory bug detection

IBM Journal of Research and Development
Efficient emulation of hardware prefetchers via event-driven helper threading

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Data-Driven Multithreading Using Conventional Microprocessors

IEEE Transactions on Parallel and Distributed Systems
Speculative pre-execution assisted by compiler (SPEAR)

Journal of Parallel and Distributed Computing - Special issue on parallel bioinspired algorithms
Design and evaluation of a hierarchical decoupled architecture

The Journal of Supercomputing
SlicK: slice-based locality exploitation for efficient redundant multithreading

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Future execution: A prefetching mechanism that uses multiple cores to speed up single threads

ACM Transactions on Architecture and Code Optimization (TACO)
Accelerating sequential programs on Chip Multiprocessors via Dynamic Prefetching Thread

Microprocessors & Microsystems
Ubiquitous memory introspection

Proceedings of the International Symposium on Code Generation and Optimization
Prefetching irregular references for software cache on cell

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Server-based data push architecture for multi-processor environments

Journal of Computer Science and Technology
Hiding cache miss penalty using priority-based execution for embedded processors

Proceedings of the conference on Design, automation and test in Europe
A low-complexity microprocessor design with speculative pre-execution

Journal of Systems Architecture: the EUROMICRO Journal
Just-In-Time Locality and Percolation for Optimizing Irregular Applications on a Manycore Architecture

Languages and Compilers for Parallel Computing
Dynamic performance tuning for speculative threads

Proceedings of the 36th annual international symposium on Computer architecture
Reducing register file size through instruction pre-execution enhanced by value prediction

ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
Software data spreading: leveraging distributed caches to improve single thread performance

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Dynamic warp subdivision for integrated branch and memory divergence tolerance

Proceedings of the 37th annual international symposium on Computer architecture
Speculative-aware execution: a simple and efficient technique for utilizing multi-cores to improve single-thread performance

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Helper thread prefetching for loosely-coupled multiprocessor systems

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Inter-core prefetching for multicore processors using migrating helper threads

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Analysis and performance results of computing betweenness centrality on IBM Cyclops64

The Journal of Supercomputing
Sprint: speculative prefetching of remote data

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Efficiently exploiting memory level parallelism on asymmetric coupled cores in the dark silicon era

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
When Prefetching Works, When It Doesn’t, and Why

ACM Transactions on Architecture and Code Optimization (TACO)
A hybrid hardware/software generated prefetching thread mechanism on chip multiprocessors

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
smt-SPRINTS: software precomputation with intelligent streaming for resource-constrained SMTs

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Dynamically dispatching speculative threads to improve sequential execution

ACM Transactions on Architecture and Code Optimization (TACO)
Coalition threading: combining traditional andnon-traditional parallelism to maximize scalability

Proceedings of the 21st international conference on Parallel architectures and compilation techniques

Quantified Score

Hi-index	0.01

Visualization

Abstract

A large number of memory accesses in memory-bound applications are irregular, such as pointer dereferences, and can be effectively targeted by thread-based prefetching techniques like Speculative Precomputation. These techniques execute instructions, for example on an available SMT thread context, that have been extracted directly from the program they are trying to accelerate. Proposed techniques typically require manual user intervention to extract and optimize instruction sequences. This paper proposes Dynamic Speculative Precomputation, which performs all necessary instruction analysis, extraction, and optimization through the use of back-end instruction analysis hardware, located off the processor's critical path. For a set of memory limited benchmarks an average speedup of 14% is achieved when constructing simple p-slices, and this gain grows to 33% when making use of aggressive optimizations.