Beating in-order stalls with "flea-flicker" two-pass pipelining

Authors:
Ronald D. Barnes;Erik M. Nystrom;John W. Sias;Sanjay J. Patel;Nacho Navarro;Wen-mei W. Hwu
Affiliations:
Center for Reliable and High-Performance Computing, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign;Center for Reliable and High-Performance Computing, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign;Center for Reliable and High-Performance Computing, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign;Center for Reliable and High-Performance Computing, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign;Center for Reliable and High-Performance Computing, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign;Center for Reliable and High-Performance Computing, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign
Venue:
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Year:
2003

Citing 17
Cited 18

Checkpoint repair for out-of-order execution machines

ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
Sentinel scheduling: a model for compiler-controlled speculative execution

ACM Transactions on Computer Systems (TOCS)
Dynamic memory disambiguation using the memory conflict buffer

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Improving data cache performance by pre-executing instructions under a cache miss

ICS '97 Proceedings of the 11th international conference on Supercomputing
Integrated predicated and speculative execution in the IMPACT EPIC architecture

Proceedings of the 25th annual international symposium on Computer architecture
Simultaneous subordinate microthreading (SSMT)

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
PIPE: a VLSI decoupled architecture

ISCA '85 Proceedings of the 12th annual international symposium on Computer architecture
A study of slipstream processors

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
OS and compiler considerations in the design of the IA-64 architecture

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Speculative precomputation: long-range prefetching of delinquent loads

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Data prefetching by dependence graph precomputation

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Dynamic speculative precomputation

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
The Alpha 21264 Microprocessor

IEEE Micro
Master/slave speculative parallelization

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Memory Latency-Tolerance Approaches for Itanium Processors: Out-of-Order Execution vs.Speculative Precomputation

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research

IEEE Computer Architecture Letters

Field-testing IMPACT EPIC research results in Itanium 2

Proceedings of the 31st annual international symposium on Computer architecture
Decoupled Software Pipelining with the Synchronization Array

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Continuous Optimization

Proceedings of the 32nd annual international symposium on Computer Architecture
Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Automatic Thread Extraction with Decoupled Software Pipelining

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
"Flea-flicker" Multipass Pipelining: An Alternative to the High-Power Out-of-Order Offense

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Beating In-Order Stalls with "Flea-Flicker" Two-Pass Pipelining

IEEE Transactions on Computers
Performance scalability of decoupled software pipelining

ACM Transactions on Architecture and Code Optimization (TACO)
Skewed redundancy

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
A performance-correctness explicitly-decoupled architecture

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Combining thread level speculation helper threads and runahead execution

Proceedings of the 23rd international conference on Supercomputing
Simultaneous speculative threading: a novel pipeline architecture implemented in sun's rock processor

Proceedings of the 36th annual international symposium on Computer architecture
Increasing memory miss tolerance for SIMD cores

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Necromancer: enhancing system throughput by animating dead cores

Proceedings of the 37th annual international symposium on Computer architecture
Analysis of execution efficiency in the microthreaded processor UTLEON3

ARCS'11 Proceedings of the 24th international conference on Architecture of computing systems
Mixed speculative multithreaded execution models

ACM Transactions on Architecture and Code Optimization (TACO)
Tuning the continual flow pipeline architecture

Proceedings of the 27th international ACM conference on International conference on supercomputing
Tuning the continual flow pipeline architecture with virtual register renaming

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Accommodating the uncertain latency of load instructionsis one of the most vexing problems in in-order microarchitecturedesign and compiler development. Compilers cangenerate schedules with a high degree of instruction-levelparallelism but cannot effectively accommodate unanticipatedlatencies; incorporating traditional out-of-order executioninto the microarchitecture hides some of this latencybut redundantly performs work done by the compiler andadds additional pipeline stages. Although effective techniques,such as prefetching and threading, have been proposedto deal with anticipable, long-latency misses, theshorter, more diffuse stalls due to difficult-to-anticipate,first- or second-level misses are less easily hidden on in-orderarchitectures. This paper addresses this problemby proposing a microarchitectural technique, referred toas two-pass pipelining, wherein the program executes ontwo in-order back-end pipelines coupled by a queue. The"advance" pipeline executes instructions greedily, withoutstalling on unanticipated latency dependences (executingindependent instructions while otherwise blocking instructionsare deferred). The "backup" pipeline allows concurrentresolution of instructions that were deferred in theother pipeline, resulting in the absorption of shorter missesand the overlap of longer ones. This paper argues that thisdesign is both achievable and a good use of transistor resourcesand shows results indicating that it can deliver significantspeedups for in-order processor designs.