A large, fast instruction window for tolerating cache misses

Authors:
Alvin R. Lebeck;Jinson Koppanalil;Tong Li;Jaidev Patwardhan;Eric Rotenberg
Affiliations:
Duke University, Durham, NC;North Carolina State University, Raleigh, NC;Duke University, Durham, NC;Duke University, Durham, NC;North Carolina State University, Raleigh, NC
Venue:
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Year:
2002

Citing 27
Cited 82

Instruction Issue Logic for High-Performance, Interruptible, Multiple Functional Unit, Pipelined Computers

IEEE Transactions on Computers
Multiscalar processors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Complexity-effective superscalar processors

Proceedings of the 24th annual international symposium on Computer architecture
Trace processors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
A dynamic multithreading processor

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Improving prediction for procedure returns with return-address-stack repair mechanisms

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Branch Prediction, Instruction-Window Size, and Cache Size: Performance Trade-Offs and Simulation Techniques

IEEE Transactions on Computers
A low-complexity issue logic

Proceedings of the 14th international conference on Supercomputing
Understanding the backward slices of performance degrading instructions

Proceedings of the 27th annual international symposium on Computer architecture
Circuits for wide-window superscalar processors

Proceedings of the 27th annual international symposium on Computer architecture
Clock rate versus IPC: the end of the road for conventional microarchitectures

Proceedings of the 27th annual international symposium on Computer architecture
Multiple-banked register file architectures

Proceedings of the 27th annual international symposium on Computer architecture
Two-level hierarchical register file organization for VLIW processors

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Reducing the complexity of the issue logic

ICS '01 Proceedings of the 15th international conference on Supercomputing
Dynamically allocating processor resources between nearby and distant ILP

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Focusing processor policies via critical-path prediction

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Power and energy reduction via pipeline balancing

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Energy-effective issue logic

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
A design space evaluation of grid processor architectures

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Select-free instruction scheduling logic

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Reducing the complexity of the register file in dynamic superscalar processors

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
SPEC CPU2000: Measuring CPU Performance in the New Millennium

Computer
The Alpha 21264 Microprocessor

IEEE Micro
Early Experiences with Olden

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Data-Flow Prescheduling for Large Instruction Windows in Out-of-Order Processors

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Characterizing and removing branch mispredictions

Characterizing and removing branch mispredictions
POWER4 system microarchitecture

IBM Journal of Research and Development

Cherry: checkpointed early resource recycling in out-of-order microprocessors

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Hierarchical Scheduling Windows

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Power-aware issue queue design for speculative instructions

Proceedings of the 40th annual Design Automation Conference
Enhancing memory level parallelism via recovery-free value prediction

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Dynamic Data Dependence Tracking and its Application to Branch Prediction

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Half-price architecture

Proceedings of the 30th annual international symposium on Computer architecture
Cyclone: a broadcast-free dynamic instruction scheduler with selective replay

Proceedings of the 30th annual international symposium on Computer architecture
Scalable Hardware Memory Disambiguation for High ILP Processors

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Reducing Design Complexity of the Load/Store Queue

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Macro-op Scheduling: Relaxing Scheduling Loop Constraints

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Scaling the issue window with look-ahead latency prediction

Proceedings of the 18th annual international conference on Supercomputing
Physical Register Inlining

Proceedings of the 31st annual international symposium on Computer architecture
A low-power in-order/out-of-order issue queue

ACM Transactions on Architecture and Code Optimization (TACO)
Late Allocation and Early Release of Physical Registers

IEEE Transactions on Computers
A case for resource-conscious out-of-order processors: towards kilo-instruction in-flight processors

MEDEA '03 Proceedings of the 2003 workshop on MEmory performance: DEaling with Applications , systems and architecture
Continual flow pipelines

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Continual Flow Pipelines: Achieving Resource-Efficient Latency Tolerance

IEEE Micro
Scalable Hardware Memory Disambiguation for High-ILP Processors

IEEE Micro
Effects of speculation on performance and issue queue design

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Toward kilo-instruction processors

ACM Transactions on Architecture and Code Optimization (TACO)
An analysis of a resource efficient checkpoint architecture

ACM Transactions on Architecture and Code Optimization (TACO)
Scalable Load and Store Processing in Latency Tolerant Processors

Proceedings of the 32nd annual international symposium on Computer Architecture
Store Buffer Design in First-Level Multibanked Data Caches

Proceedings of the 32nd annual international symposium on Computer Architecture
Enhancing Memory-Level Parallelism via Recovery-Free Value Prediction

IEEE Transactions on Computers
Instruction packing: reducing power and delay of the dynamic scheduling logic

ISLPED '05 Proceedings of the 2005 international symposium on Low power electronics and design
Kilo-Instruction Processors: Overcoming the Memory Wall

IEEE Micro
Tornado warning: the perils of selective replay in multithreaded processors

Proceedings of the 19th annual international conference on Supercomputing
Low-power, low-complexity instruction issue using compiler assistance

Proceedings of the 19th annual international conference on Supercomputing
Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Memory State Compressors for Giga-Scale Checkpoint/Restore

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Reducing the Energy of Speculative Instruction Schedulers

ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Power-Efficient Wakeup Tag Broadcast

ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Speculative execution for hiding memory latency

MEDEA '04 Proceedings of the 2004 workshop on MEmory performance: DEaling with Applications , systems and architecture
Scalable Load and Store Processing in Latency-Tolerant Processors

IEEE Micro
Kilo-instruction processors, runahead and prefetching

Proceedings of the 3rd conference on Computing frontiers
Instruction packing: Toward fast and energy-efficient instruction scheduling

ACM Transactions on Architecture and Code Optimization (TACO)
CAVA: Using checkpoint-assisted value prediction to hide L2 misses

ACM Transactions on Architecture and Code Optimization (TACO)
SEED: scalable, efficient enforcement of dependences

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Energy-efficient dynamic instruction scheduling logic through instruction grouping

Proceedings of the 2006 international symposium on Low power electronics and design
Exploiting Operand Availability for Efficient Simultaneous Multithreading

IEEE Transactions on Computers
Register port complexity reduction in wide-issue processors with selective instruction execution

Microprocessors & Microsystems
Unified microprocessor core storage

Proceedings of the 4th international conference on Computing frontiers
By-passing the out-of-order execution pipeline to increase energy-efficiency

Proceedings of the 4th international conference on Computing frontiers
Matrix scheduler reloaded

Proceedings of the 34th annual international symposium on Computer architecture
Ginger: control independence using tag rewriting

Proceedings of the 34th annual international symposium on Computer architecture
Transparent control independence (TCI)

Proceedings of the 34th annual international symposium on Computer architecture
On reducing energy-consumption by late-inserting instructions into the issue queue

ISLPED '07 Proceedings of the 2007 international symposium on Low power electronics and design
Scalable Dynamic Instruction Scheduler through Wake-Up Spatial Locality

IEEE Transactions on Computers
Building a large instruction window through ROB compression

MEDEA '07 Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture
Future ILP processors

International Journal of High Performance Computing and Networking
A Two-Level Load/Store Queue Based on Execution Locality

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Hiding cache miss penalty using priority-based execution for embedded processors

Proceedings of the conference on Design, automation and test in Europe
Streamlining long latency instructions for seamlessly combined out-of-order and in-order execution

Microprocessors & Microsystems
Skewed redundancy

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
A low-complexity microprocessor design with speculative pre-execution

Journal of Systems Architecture: the EUROMICRO Journal
On the potential of latency tolerant execution in speculative multithreading

IFMT '08 Proceedings of the 1st international forum on Next-generation multicore/manycore technologies
HeDGE: Hybrid Dataflow Graph Execution in the Issue Logic

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
A performance-correctness explicitly-decoupled architecture

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
A complexity-effective microprocessor design with decoupled dispatch queues and prefetching

Parallel Computing
Accurate Instruction Pre-scheduling in Dynamically Scheduled Processors

Transactions on High-Performance Embedded Architectures and Compilers II
An energy-efficient instruction scheduler design with two-level shelving and adaptive banking

Journal of Computer Science and Technology
Simultaneous speculative threading: a novel pipeline architecture implemented in sun's rock processor

Proceedings of the 36th annual international symposium on Computer architecture
An energy-efficient checkpointing mechanism for out of order commit processor

Proceedings of the 14th ACM/IEEE international symposium on Low power electronics and design
Design and optimization of the store vectors memory dependence predictor

ACM Transactions on Architecture and Code Optimization (TACO)
Folding active list for high performance and low power

ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
Exploiting execution locality with a decoupled Kilo-instruction processor

ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
Forwardflow: a scalable core for power-constrained CMPs

Proceedings of the 37th annual international symposium on Computer architecture
Energy-efficient dynamic instruction scheduling logic through instruction grouping

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
CROB: implementing a large instruction window through compression

Transactions on high-performance embedded architectures and compilers III
Energy-efficient mechanisms for managing thread context in throughput processors

Proceedings of the 38th annual international symposium on Computer architecture
Efficiently exploiting memory level parallelism on asymmetric coupled cores in the dark silicon era

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Non-uniform instruction scheduling

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Instruction recirculation: eliminating counting logic in wakeup-free schedulers

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Reducing delay and power consumption of the wakeup logic through instruction packing and tag memoization

PACS'04 Proceedings of the 4th international conference on Power-Aware Computer Systems
A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors

ACM Transactions on Computer Systems (TOCS)
Disjoint out-of-order execution processor

ACM Transactions on Architecture and Code Optimization (TACO)
Tuning the continual flow pipeline architecture

Proceedings of the 27th international ACM conference on International conference on supercomputing
Virtual register renaming: energy efficient substrate for continual flow pipelines

Proceedings of the 23rd ACM international conference on Great lakes symposium on VLSI
MLP-aware dynamic instruction window resizing for adaptively exploiting both ILP and MLP

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Tuning the continual flow pipeline architecture with virtual register renaming

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.01

Visualization

Abstract

Instruction window size is an important design parameter for many modern processors. Large instruction windows offer the potential advantage of exposing large amounts of instruction level parallelism. Unfortunately naively scaling conventional window designs can significantly degrade clock cycle time, undermining the benefits of increased parallelism.This paper presents a new instruction window design targeted at achieving the latency tolerance of large windows with the clock cycle time of small windows. The key observation is that instructions dependent on a long latency operation (e.g., cache miss) cannot execute until that source operation completes. These instructions are moved out of the conventional, small, issue queue to a much larger waiting instruction buffer (WIB). When the long latency operation completes, the instructions are reinserted into the issue queue. In this paper, we focus specifically on load cache misses and their dependent instructions. Simulations reveal that, for an 8-way processor, a 2K-entry WIB with a 32-entry issue queue can achieve speedups of 20%, 84%, and 50% over a conventional 32-entry issue queue for a subset of the SPEC CINT2000, SPEC CFP2000, and Olden benchmarks, respectively.