Loose Loops Sink Chips

Authors:
Eric Borch;Srilatha Manne;Joel Emer;Eric Tune
Affiliations:
-;-;-;-
Venue:
HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Year:
2002

Citing 0
Cited 71

Low-complexity reorder buffer architecture

ICS '02 Proceedings of the 16th international conference on Supercomputing
Implementing optimizations at decode time

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Dynamic addressing memory arrays with physical locality

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Reducing register ports for higher speed and lower energy

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Reducing register ports using delayed write-back queues and operand pre-fetch

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Predicate prediction for efficient out-of-order execution

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Banked multiported register files for high-frequency superscalar microprocessors

Proceedings of the 30th annual international symposium on Computer architecture
Reducing reorder buffer complexity through selective operand caching

Proceedings of the 2003 international symposium on Low power electronics and design
The microarchitecture of a low power register file

Proceedings of the 2003 international symposium on Low power electronics and design
Using Interaction Costs for Microarchitectural Bottleneck Analysis

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Macro-op Scheduling: Relaxing Scheduling Loop Constraints

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Complexity-Effective Reorder Buffer Designs for Superscalar Processors

IEEE Transactions on Computers
Isolating Short-Lived Operands for Energy Reduction

IEEE Transactions on Computers
Wire Delay is Not a Problem for SMT (In the Near Future)

Proceedings of the 31st annual international symposium on Computer architecture
Use-Based Register Caching with Decoupled Indexing

Proceedings of the 31st annual international symposium on Computer architecture
Physical Register Inlining

Proceedings of the 31st annual international symposium on Computer architecture
Late Allocation and Early Release of Physical Registers

IEEE Transactions on Computers
Interaction cost and shotgun profiling

ACM Transactions on Architecture and Code Optimization (TACO)
Dataflow Mini-Graphs: Amplifying Superscalar Capacity and Bandwidth

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Balanced Multithreading: Increasing Throughput via a Low Cost Multithreading Hierarchy

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Register Packing: Exploiting Narrow-Width Operands for Reducing Register File Pressure

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Effects of speculation on performance and issue queue design

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
A Speculative Control Scheme for an Energy-Efficient Banked Register File

IEEE Transactions on Computers
RENO: A Rename-Based Instruction Optimizer

Proceedings of the 32nd annual international symposium on Computer Architecture
Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization

Proceedings of the 32nd annual international symposium on Computer Architecture
Store Buffer Design in First-Level Multibanked Data Caches

Proceedings of the 32nd annual international symposium on Computer Architecture
An asymmetric clustered processor based on value content

Proceedings of the 19th annual international conference on Supercomputing
Dynamically configurable shared CMP helper engines for improved performance

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
An automated design flow for 3D microarchitecture evaluation

ASP-DAC '06 Proceedings of the 2006 Asia and South Pacific Design Automation Conference
Microarchitecture evaluation with floorplanning and interconnect pipelining

Proceedings of the 2005 Asia and South Pacific Design Automation Conference
SPARTAN: speculative avoidance of register allocations to transient values for performance and energy efficiency

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Early Register Deallocation Mechanisms Using Checkpointed Register Files

IEEE Transactions on Computers
Selective writeback: exploiting transient values for energy-efficiency and performance

Proceedings of the 2006 international symposium on Low power electronics and design
Register file caching for energy efficiency

Proceedings of the 2006 international symposium on Low power electronics and design
Register port complexity reduction in wide-issue processors with selective instruction execution

Microprocessors & Microsystems
Reducing non-deterministic loads in low-power caches via early cache set resolution

Microprocessors & Microsystems
ReCycle:: pipeline adaptation to tolerate process variation

Proceedings of the 34th annual international symposium on Computer architecture
Matrix scheduler reloaded

Proceedings of the 34th annual international symposium on Computer architecture
Late-binding: enabling unordered load-store queues

Proceedings of the 34th annual international symposium on Computer architecture
An L2-miss-driven early register deallocation for SMT processors

Proceedings of the 21st annual international conference on Supercomputing
Power-aware operand delivery

ISLPED '07 Proceedings of the 2007 international symposium on Low power electronics and design
Predicting and Exploiting Transient Values for Reducing Register File Pressure and Energy Consumption

IEEE Transactions on Computers
The revolution inside the box

Communications of the ACM - Web science
Exploiting multilevel parallelism using OpenMP on a massive multithreaded architecture

Journal of Embedded Computing - Issues in embedded single-chip multicore architectures
Asymmetrically banked value-aware register files for low-energy and high-performance

Microprocessors & Microsystems
ReVIVaL: A Variation-Tolerant Architecture Using Voltage Interpolation and Variable Latency

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Achieving Out-of-Order Performance with Almost In-Order Complexity

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Fetch-Criticality Reduction through Control Independence

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Power-efficient clustering via incomplete bypassing

Proceedings of the 13th international symposium on Low power electronics and design
Investigating the effects of fine-grain three-dimensional integration on microarchitecture design

ACM Journal on Emerging Technologies in Computing Systems (JETC)
Reducing register pressure in SMT processors through L2-miss-driven early register release

ACM Transactions on Architecture and Code Optimization (TACO)
Selective writeback: reducing register file pressure and energy consumption

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
A criticality-driven microarchitectural three dimensional (3D) floorplanner

Proceedings of the 2009 Asia and South Pacific Design Automation Conference
Shapeshifter: Dynamically changing pipeline width and speed to address process variations

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Exploring the limits of early register release: Exploiting compiler analysis

ACM Transactions on Architecture and Code Optimization (TACO)
MicroFix: exploiting path-grained timing adaptability for improving power-performance efficiency

Proceedings of the 14th ACM/IEEE international symposium on Low power electronics and design
Energy-efficient register caching with compiler assistance

ACM Transactions on Architecture and Code Optimization (TACO)
Architectural assessment of design techniques to improve speed and robustness in embedded microprocessors

Proceedings of the 46th Annual Design Automation Conference
Trifecta: a nonspeculative scheme to exploit common, data-dependent subcritical paths

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Criticality-driven superscalar design space exploration

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Exploiting narrow-width values for thermal-aware register file designs

Proceedings of the Conference on Design, Automation and Test in Europe
MicroFix: Using timing interpolation and delay sensors for power reduction

ACM Transactions on Design Automation of Electronic Systems (TODAES)
On the exploitation of narrow-width values for improving register file reliability

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
CRIB: consolidated rename, issue, and bypass

Proceedings of the 38th annual international symposium on Computer architecture
Energy-efficient mechanisms for managing thread context in throughput processors

Proceedings of the 38th annual international symposium on Computer architecture
Do trace cache, value prediction and prefetching improve SMT throughput?

ARCS'06 Proceedings of the 19th international conference on Architecture of Computing Systems
Identifying and predicting timing-critical instructions to boost timing speculation

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Idempotent processor architecture

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Exploiting narrow values for energy efficiency in the register files of superscalar microprocessors

PATMOS'06 Proceedings of the 16th international conference on Integrated Circuit and System Design: power and Timing Modeling, Optimization and Simulation
A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors

ACM Transactions on Computer Systems (TOCS)
RELOCATE: register file local access pattern redistribution mechanism for power and thermal management in out-of-order embedded processor

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers

Quantified Score

Hi-index	0.02

Visualization

Abstract

This paper explores the concept of micro-architectural loops and discusses their impact on processor pipelines. In particular, we establish the relationship between loose loops and pipeline length and configuration,and show their impact on performance. We then evaluate the load resolution loop in detail and propose the distributed register algorithm (DRA) as a way of reducing this loop. It decreases the performance loss due to load mis-speculations by reducing the issue-to-execute latency in the pipeline. A new loose loop is introduced into the pipeline by the DRA, but the frequency of mis-speculations is very low. The reduction in latency from issue to execute, along with a low mis-speculation rate in the DRA result in up to a 4% to 15% improvement in performance using a detailed architectural simulator.