Speculative precomputation: long-range prefetching of delinquent loads

Authors:
Jamison D. Collins;Hong Wang;Dean M. Tullsen;Christopher Hughes;Yong-Fong Lee;Dan Lavery;John P. Shen
Affiliations:
Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA;Microprocessor Research Lab, Intel Corporation, Santa Clara, CA;Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA;Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL;Microcomputer Software Lab, Intel Corporation, Santa Clara, CA;Microcomputer Software Lab, Intel Corporation, Santa Clara, CA;Microprocessor Research Lab, Intel Corporation, Santa Clara, CA
Venue:
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Year:
2001

Citing 17
Cited 94

Implementing stack simulation for highly-associative memories

SIGMETRICS '91 Proceedings of the 1991 ACM SIGMETRICS conference on Measurement and modeling of computer systems
A comparative analysis of schemes for correlated branch prediction

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Simultaneous multithreading: maximizing on-chip parallelism

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Olden: parallelizing programs with dynamic data structures on distributed-memory machines

Olden: parallelizing programs with dynamic data structures on distributed-memory machines
Prefetching using Markov predictors

Proceedings of the 24th annual international symposium on Computer architecture
Threaded multiple path execution

Proceedings of the 25th annual international symposium on Computer architecture
Dependence based prefetching for linked data structures

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Simultaneous subordinate microthreading (SSMT)

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Understanding the backward slices of performance degrading instructions

Proceedings of the 27th annual international symposium on Computer architecture
Slipstream processors: improving both performance and fault tolerance

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Introducing the IA-64 Architecture

IEEE Micro
Itanium Processor Microarchitecture

IEEE Micro
The Intel IA-64 Compiler Code Generator

IEEE Micro
An Advanced Optimizer for the IA-64 Architecture

IEEE Micro
Speculative Data-Driven Multithreading

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Aspects of cache memory and instruction buffer performance

Aspects of cache memory and instruction buffer performance

Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Data prefetching by dependence graph precomputation

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Post-pass binary adaptation for software-based speculative precomputation

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Difficult-path branch prediction using subordinate microthreads

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Dynamic speculative precomputation

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Handling long-latency loads in a simultaneous multithreading processor

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Design and evaluation of compiler algorithms for pre-execution

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
A Decoupled Predictor-Directed Stream Prefetching Architecture

IEEE Transactions on Computers
Quantifying Instruction Criticality

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Transparent Threads: Resource Sharing in SMT Processors for High Single-Thread Performance

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
A Programmable Memory Hierarchy for Prefetching Linked Data Structures

ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
Exploiting the Prefetching Effect Provided by Executing Mispredicted Load Instructions

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Value-Profile Guided Stride Prefetching for Irregular Code

CC '02 Proceedings of the 11th International Conference on Compiler Construction
Pointer cache assisted prefetching

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Microarchitectural support for precomputation microthreads

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
A framework for modeling and optimization of prescient instruction prefetch

SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Recycling waste: exploiting wrong-path execution to improve branch prediction

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Enhancing memory level parallelism via recovery-free value prediction

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Slipstream Execution Mode for CMP-Based Multiprocessors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Guided region prefetching: a cooperative hardware/software approach

Proceedings of the 30th annual international symposium on Computer architecture
The Performance of Runtime Data Cache Prefetching in a Dynamic Optimization System

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Beating in-order stalls with "flea-flicker" two-pass pipelining

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Fighting the memory wall with assisted execution

Proceedings of the 1st conference on Computing frontiers
Physical Experimentation with Prefetching Helper Threads on Intel's Hyper-Threaded Processors

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Static Identification of Delinquent Loads

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
A general framework for prefetch scheduling in linked data structures and its application to multi-chain prefetching

ACM Transactions on Computer Systems (TOCS)
Prefetch injection based on hardware monitoring and object metadata

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Microarchitecture Optimizations for Exploiting Memory-Level Parallelism

Proceedings of the 31st annual international symposium on Computer architecture
A study of source-level compiler algorithms for automatic construction of pre-execution code

ACM Transactions on Computer Systems (TOCS)
Interaction cost and shotgun profiling

ACM Transactions on Architecture and Code Optimization (TACO)
Helper threads via virtual multithreading on an experimental itanium® 2 processor-based platform

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Compiler orchestrated prefetching via speculation and predication

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Decoupled Software Pipelining with the Synchronization Array

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Helper Threads via Virtual Multithreading

IEEE Micro
Tolerating memory latency through push prefetching for pointer-intensive applications

ACM Transactions on Architecture and Code Optimization (TACO)
The Impact of Incorrectly Speculated Memory Operations in a Multithreaded Architecture

IEEE Transactions on Parallel and Distributed Systems
Threads cannot be implemented as a library

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Mitosis compiler: an infrastructure for speculative threading based on pre-computation slices

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Energy-Effectiveness of Pre-Execution and Energy-Aware P-Thread Selection

Proceedings of the 32nd annual international symposium on Computer Architecture
Enhancing Memory-Level Parallelism via Recovery-Free Value Prediction

IEEE Transactions on Computers
Improving database performance on simultaneous multithreading processors

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Multigrain parallel Delaunay Mesh generation: challenges and opportunities for multithreaded architectures

Proceedings of the 19th annual international conference on Supercomputing
Store-Ordered Streaming of Shared Memory

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Design and Implementation of a Compiler Framework for Helper Threading on Multi-core Processors

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Dynamic Helper Threaded Prefetching on the Sun UltraSPARC CMP Processor

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
"Flea-flicker" Multipass Pipelining: An Alternative to the High-Power Out-of-Order Offense

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Stream Programming on General-Purpose Processors

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Beating In-Order Stalls with "Flea-Flicker" Two-Pass Pipelining

IEEE Transactions on Computers
A Self-Repairing Prefetcher in an Event-Driven Dynamic Optimization Framework

Proceedings of the International Symposium on Code Generation and Optimization
Speculative pre-execution assisted by compiler (SPEAR)

Journal of Parallel and Distributed Computing - Special issue on parallel bioinspired algorithms
Design and evaluation of a hierarchical decoupled architecture

The Journal of Supercomputing
An efficient implementation of a 3D wavelet transform based encoder on hyper-threading technology

Parallel Computing
Accelerating sequential programs on Chip Multiprocessors via Dynamic Prefetching Thread

Microprocessors & Microsystems
Hybrid multi-core architecture for boosting single-threaded performance

ACM SIGARCH Computer Architecture News
Hardware support for software controlled multithreading

ACM SIGARCH Computer Architecture News
Function level parallelism driven by data dependencies

ACM SIGARCH Computer Architecture News
Ubiquitous memory introspection

Proceedings of the International Symposium on Code Generation and Optimization
Optimization of data prefetch helper threads with path-expression based statistical modeling

Proceedings of the 21st annual international conference on Supercomputing
Data prefetching and address pre-calculation through instruction pre-execution with two-step physical register deallocation

MEDEA '07 Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture
Exploring the performance limits of simultaneous multithreading for memory intensive applications

The Journal of Supercomputing
Accurate branch prediction for short threads

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Prefetching irregular references for software cache on cell

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Focused prefetching: performance oriented prefetching based on commit stalls

Proceedings of the 22nd annual international conference on Supercomputing
Performance scalability of decoupled software pipelining

ACM Transactions on Architecture and Code Optimization (TACO)
Distilling the essence of proprietary workloads into miniature benchmarks

ACM Transactions on Architecture and Code Optimization (TACO)
An Operating System Architecture for Organic Computing in Embedded Real-Time Systems

ATC '08 Proceedings of the 5th international conference on Autonomic and Trusted Computing
A low-complexity microprocessor design with speculative pre-execution

Journal of Systems Architecture: the EUROMICRO Journal
Memory-level parallelism aware fetch policies for simultaneous multithreading processors

ACM Transactions on Architecture and Code Optimization (TACO)
A performance-correctness explicitly-decoupled architecture

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Exploiting Speculative TLP in Recursive Programs by Dynamic Thread Prediction

CC '09 Proceedings of the 18th International Conference on Compiler Construction: Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2009
A complexity-effective microprocessor design with decoupled dispatch queues and prefetching

Parallel Computing
Combining thread level speculation helper threads and runahead execution

Proceedings of the 23rd international conference on Supercomputing
Algorithm, software, and hardware optimizations for Delaunay mesh generation on simultaneous multithreaded architectures

Journal of Parallel and Distributed Computing
Boosting single-thread performance in multi-core systems through fine-grain multi-threading

Proceedings of the 36th annual international symposium on Computer architecture
COMPASS: a programmable data prefetcher using idle GPU shaders

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Chameleon: Virtualizing idle acceleration cores of a heterogeneous multicore processor for caching and prefetching

ACM Transactions on Architecture and Code Optimization (TACO)
Reducing misspeculation penalty in trace-level speculative multithreaded architectures

ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
Reducing register file size through instruction pre-execution enhanced by value prediction

ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
Software data spreading: leveraging distributed caches to improve single thread performance

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Speculative-aware execution: a simple and efficient technique for utilizing multi-cores to improve single-thread performance

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Efficient runahead threads

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Inter-core prefetching for multicore processors using migrating helper threads

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Analysis and performance results of computing betweenness centrality on IBM Cyclops64

The Journal of Supercomputing
Exploring the capacity of a modern SMT architecture to deliver high scientific application performance

HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
A low-complexity issue queue design with speculative pre-execution

HiPC'05 Proceedings of the 12th international conference on High Performance Computing
Design and effectiveness of small-sized decoupled dispatch queues

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
A hybrid hardware/software generated prefetching thread mechanism on chip multiprocessors

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
smt-SPRINTS: software precomputation with intelligent streaming for resource-constrained SMTs

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Targeted data prefetching

ACSAC'05 Proceedings of the 10th Asia-Pacific conference on Advances in Computer Systems Architecture
Mixed speculative multithreaded execution models

ACM Transactions on Architecture and Code Optimization (TACO)
Coalition threading: combining traditional andnon-traditional parallelism to maximize scalability

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Automatic Skeleton-Driven Memory Affinity for Transactional Worklist Applications

International Journal of Parallel Programming

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper explores Speculative Precomputation, a technique that uses idle thread context in a multithreaded architecture to improve performance of single-threaded applications. It attacks program stalls from data cache misses by pre-computing future memory accesses in available thread contexts, and prefetching these data. This technique is evaluated by simulating the performance of a research processor based on the Itanium™ ISA supporting Simultaneous Multithreading. Two primary forms of Speculative Precomputation are evaluated. If only the non-speculative thread spawns speculative threads, performance gains of up to 30% are achieved when assuming ideal hardware. However, this speedup drops considerably with more realistic hardware assumptions. Permitting speculative threads to directly spawn additional speculative threads reduces the overhead associated with spawning threads and enables significantly more aggressive speculation, overcoming this limitation. Even with realistic costs for spawning threads, speedups as high as 169% are achieved, with an average speedup of 76%.