Continual flow pipelines

Authors:
Srikanth T. Srinivasan;Ravi Rajwar;Haitham Akkary;Amit Gandhi;Mike Upton
Affiliations:
Intel Corporation;Intel Corporation;Intel Corporation;Intel Corporation;Intel Corporation
Venue:
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Year:
2004

Citing 18
Cited 70

Multiscalar processors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Register renaming and dynamic speculation: an alternative approach

MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Improving data cache performance by pre-executing instructions under a cache miss

ICS '97 Proceedings of the 11th international conference on Supercomputing
DataScalar architectures

Proceedings of the 24th annual international symposium on Computer architecture
Memory dependence prediction using store sets

Proceedings of the 25th annual international symposium on Computer architecture
A dynamic multithreading processor

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Simultaneous subordinate microthreading (SSMT)

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Multiple-banked register file architectures

Proceedings of the 27th annual international symposium on Computer architecture
Execution-based prediction using speculative slices

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Dynamically allocating processor resources between nearby and distant ILP

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
A large, fast instruction window for tolerating cache misses

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Reducing the complexity of the register file in dynamic superscalar processors

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Cherry: checkpointed early resource recycling in out-of-order microprocessors

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture

Proceedings of the 30th annual international symposium on Computer architecture
Speculative Data-Driven Multithreading

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Out-of-Order Commit Processors

HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture

Continual Flow Pipelines: Achieving Resource-Efficient Latency Tolerance

IEEE Micro
Toward kilo-instruction processors

ACM Transactions on Architecture and Code Optimization (TACO)
Scalable Load and Store Processing in Latency Tolerant Processors

Proceedings of the 32nd annual international symposium on Computer Architecture
Instruction packing: reducing power and delay of the dynamic scheduling logic

ISLPED '05 Proceedings of the 2005 international symposium on Low power electronics and design
Kilo-Instruction Processors: Overcoming the Memory Wall

IEEE Micro
Store-Ordered Streaming of Shared Memory

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
A Simple Divide-and-Conquer Approach for Neural-Class Branch Prediction

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Power-Efficient Wakeup Tag Broadcast

ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Scalable Store-Load Forwarding via Store Queue Index Prediction

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Cherry-MP: Correctly Integrating Checkpointed Early Resource Recycling in Chip Multiprocessors

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
"Flea-flicker" Multipass Pipelining: An Alternative to the High-Power Out-of-Order Offense

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Scalable Load and Store Processing in Latency-Tolerant Processors

IEEE Micro
Kilo-instruction processors, runahead and prefetching

Proceedings of the 3rd conference on Computing frontiers
Instruction packing: Toward fast and energy-efficient instruction scheduling

ACM Transactions on Architecture and Code Optimization (TACO)
CAVA: Using checkpoint-assisted value prediction to hide L2 misses

ACM Transactions on Architecture and Code Optimization (TACO)
Adaptive reorder buffers for SMT processors

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
SPARTAN: speculative avoidance of register allocations to transient values for performance and energy efficiency

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
A simple speculative load control mechanism for energy saving

MEDEA '06 Proceedings of the 2006 workshop on MEmory performance: DEaling with Applications, systems and architectures
Diverge-Merge Processor (DMP): Dynamic Predicated Execution of Complex Control-Flow Graphs Based on Frequently Executed Paths

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
NoSQ: Store-Load Communication without a Store Queue

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Scalable Cache Miss Handling for High Memory-Level Parallelism

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
A comparison of two policies for issuing instructions speculatively

Journal of Systems Architecture: the EUROMICRO Journal
BulkSC: bulk enforcement of sequential consistency

Proceedings of the 34th annual international symposium on Computer architecture
Matrix scheduler reloaded

Proceedings of the 34th annual international symposium on Computer architecture
Late-binding: enabling unordered load-store queues

Proceedings of the 34th annual international symposium on Computer architecture
Ginger: control independence using tag rewriting

Proceedings of the 34th annual international symposium on Computer architecture
Transparent control independence (TCI)

Proceedings of the 34th annual international symposium on Computer architecture
NoSQ: Store-Load Communication without a Store Queue

IEEE Micro
An L2-miss-driven early register deallocation for SMT processors

Proceedings of the 21st annual international conference on Supercomputing
On reducing energy-consumption by late-inserting instructions into the issue queue

ISLPED '07 Proceedings of the 2007 international symposium on Low power electronics and design
Data prefetching and address pre-calculation through instruction pre-execution with two-step physical register deallocation

MEDEA '07 Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture
Energy saving through a simple load control mechanism

ACM SIGARCH Computer Architecture News
Hiding the misprediction penalty of a resource-efficient high-performance processor

ACM Transactions on Architecture and Code Optimization (TACO)
Predicting and Exploiting Transient Values for Reducing Register File Pressure and Energy Consumption

IEEE Transactions on Computers
Improving single-thread performance with fine-grain state maintenance

Proceedings of the 5th conference on Computing frontiers
A Two-Level Load/Store Queue Based on Execution Locality

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Streamlining long latency instructions for seamlessly combined out-of-order and in-order execution

Microprocessors & Microsystems
Skewed redundancy

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Reducing register pressure in SMT processors through L2-miss-driven early register release

ACM Transactions on Architecture and Code Optimization (TACO)
On the potential of latency tolerant execution in speculative multithreading

IFMT '08 Proceedings of the 1st international forum on Next-generation multicore/manycore technologies
Memory-level parallelism aware fetch policies for simultaneous multithreading processors

ACM Transactions on Architecture and Code Optimization (TACO)
A distributed processor state management architecture for large-window processors

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
A mechanistic performance model for superscalar out-of-order processors

ACM Transactions on Computer Systems (TOCS)
Decoupled store completion/silent deterministic replay: enabling scalable data memory for CPR/CFP processors

Proceedings of the 36th annual international symposium on Computer architecture
Simultaneous speculative threading: a novel pipeline architecture implemented in sun's rock processor

Proceedings of the 36th annual international symposium on Computer architecture
Checkpoint allocation and release

ACM Transactions on Architecture and Code Optimization (TACO)
Design and optimization of the store vectors memory dependence predictor

ACM Transactions on Architecture and Code Optimization (TACO)
Increasing memory miss tolerance for SIMD cores

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Folding active list for high performance and low power

ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
Exploiting execution locality with a decoupled Kilo-instruction processor

ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
Reducing register file size through instruction pre-execution enhanced by value prediction

ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
Forwardflow: a scalable core for power-constrained CMPs

Proceedings of the 37th annual international symposium on Computer architecture
RETCON: transactional repair without replay

Proceedings of the 37th annual international symposium on Computer architecture
Speculative-aware execution: a simple and efficient technique for utilizing multi-cores to improve single-thread performance

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Efficiently exploiting memory level parallelism on asymmetric coupled cores in the dark silicon era

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Reducing delay and power consumption of the wakeup logic through instruction packing and tag memoization

PACS'04 Proceedings of the 4th international conference on Power-Aware Computer Systems
Distributed replay protocol for distributed uniprocessors

Proceedings of the 26th ACM international conference on Supercomputing
Disjoint out-of-order execution processor

ACM Transactions on Architecture and Code Optimization (TACO)
Implicit transactional memory in kilo-instruction multiprocessors

ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture
Virtual register renaming

ARCS'13 Proceedings of the 26th international conference on Architecture of Computing Systems
Discerning the dominant out-of-order performance advantage: is it speculation or dynamism?

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Control-Flow Decoupling

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Tuning the continual flow pipeline architecture

Proceedings of the 27th international ACM conference on International conference on supercomputing
Virtual register renaming: energy efficient substrate for continual flow pipelines

Proceedings of the 23rd ACM international conference on Great lakes symposium on VLSI
A unified view of non-monotonic core selection and application steering in heterogeneous chip multiprocessors

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
MLP-aware dynamic instruction window resizing for adaptively exploiting both ILP and MLP

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Tuning the continual flow pipeline architecture with virtual register renaming

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Increased integration in the form of multiple processor cores on a single die, relatively constant die sizes, shrinking power envelopes, and emerging applications create a new challenge for processor architects. How to build a processor that provides high single-thread performance and enables multiple of these to be placed on the same die for high throughput while dynamically adapting for future applications? Conventional approaches for high single-thread performance rely on large and complex cores to sustain a large instruction window for memory tolerance, making them unsuitable for multi-core chips. We present Continual Flow Pipelines (CFP) as a new non-blocking processor pipeline architecture that achieves the performance of a large instruction window without requiring cycle-critical structures such as the scheduler and register file to be large. We show that to achieve benefits of a large instruction window, inefficiencies in management of both the scheduler and register file must be addressed, and we propose a unified solution. The non-blocking property of CFP keeps key processor structures affecting cycle time and power (scheduler, register file), and die size (second level cache) small. The memory latency-tolerant CFP core allows multiple cores on a single die while outperforming current processor cores for single-thread applications.