"Flea-flicker" Multipass Pipelining: An Alternative to the High-Power Out-of-Order Offense

Authors:
Ronald D. Barnes;Shane Ryoo;Wen-mei W. Hwu
Affiliations:
George Mason University;University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign
Venue:
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Year:
2005

Citing 22
Cited 5

Improving data cache performance by pre-executing instructions under a cache miss

ICS '97 Proceedings of the 11th international conference on Supercomputing
Complexity-effective superscalar processors

Proceedings of the 24th annual international symposium on Computer architecture
Power considerations in the design of the Alpha 21264 microprocessor

DAC '98 Proceedings of the 35th annual Design Automation Conference
Simultaneous subordinate microthreading (SSMT)

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Dynamic memory disambiguation in the presence of out-of-order store issuing

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Wattch: a framework for architectural-level power analysis and optimizations

Proceedings of the 27th annual international symposium on Computer architecture
Execution-based prediction using speculative slices

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Speculative precomputation: long-range prefetching of delinquent loads

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Locality vs. criticality

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
The Alpha 21264 Microprocessor

IEEE Micro
Itanium 2 Processor Microarchitecture

IEEE Micro
The AMD Opteron Processor for Multiprocessor Servers

IEEE Micro
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling

Proceedings of the 30th annual international symposium on Computer architecture
Register Renaming and Scheduling for Dynamic Execution of Predicated Code

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Speculative Data-Driven Multithreading

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Beating in-order stalls with "flea-flicker" two-pass pipelining

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Fighting the memory wall with assisted execution

Proceedings of the 1st conference on Computing frontiers
Microarchitecture Optimizations for Exploiting Memory-Level Parallelism

Proceedings of the 31st annual international symposium on Computer architecture
Field-testing IMPACT EPIC research results in Itanium 2

Proceedings of the 31st annual international symposium on Computer architecture
Continual flow pipelines

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems

Tolerating Cache-Miss Latency with Multipass Pipelines

IEEE Micro
Using fine grain multithreading for energy efficient computing

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Performance scalability of decoupled software pipelining

ACM Transactions on Architecture and Code Optimization (TACO)
OUTRIDER: efficient memory latency tolerance with decoupled strands

Proceedings of the 38th annual international symposium on Computer architecture
Tuning the continual flow pipeline architecture with virtual register renaming

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

As microprocessor designs become increasingly powerand complexity-conscious, future microarchitectures must decrease their reliance on expensive dynamic scheduling structures. While compilers have generally proven adept at planning useful static instruction-level parallelism, relying solely on the compiler驴s instruction execution arrangement performs poorly when cache misses occur, because variable latency is not well tolerated. This paper proposes a new microarchitectural model, multipass pipelining, that exploits meticulous compile-time scheduling on simple in-order hardware while achieving excellent cache miss tolerance through persistent advance preexecution beyond otherwise stalled instructions. The pipeline systematically makes multiple passes through instructions that follow a stalled instruction. Each pass increases the speed and energy efficiency of the subsequent ones by preserving computed results. The concept of multiple passes and successive improvement of efficiency across passes in a single pipeline distinguishes multipass pipelining from other runahead schemes. Simulation results show that the multipass technique achieves 77% of the cycle reduction of aggressive out-of-order execution relative to in-order execution. In addition, microarchitectural-level power simulation indicates that benefits of multipass are achieved at a fraction of the power overhead of full dynamic scheduling.