The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays

Authors:
M. S. Hrishikesh;Doug Burger;Norman P. Jouppi;Stephen W. Keckler;Keith I. Farkas;Premkishore Shivakumar
Affiliations:
The University of Texas, Austin;The University of Texas, Austin;Compaq Computer Corporation;The University of Texas, Austin;Compaq Computer Corporation;The University of Texas, Austin
Venue:
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Year:
2002

Citing 8
Cited 91

Optimal pipelining in supercomputers

ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Complexity-effective superscalar processors

Proceedings of the 24th annual international symposium on Computer architecture
On pipelining dynamic instruction scheduling logic

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Rethinking Deep-Submicron Circuit Design

Computer
Design of High-Performance Microprocessor Circuits

Design of High-Performance Microprocessor Circuits
Select-free instruction scheduling logic

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Measuring Experimental Error in Microprocessor Simulation

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Activity-Sensitive Flip-Flop and Latch Selection for Reduced Energy

ARVLSI '01 Proceedings of the 2001 Conference on Advanced Research in VLSI

A case for dynamic pipeline scaling

CASES '02 Proceedings of the 2002 international conference on Compilers, architecture, and synthesis for embedded systems
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Hierarchical Scheduling Windows

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Dynamic addressing memory arrays with physical locality

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Optimizing pipelines for power and performance

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Exploiting data-width locality to increase superscalar execution bandwidth

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Microarchitectural denial of service: insuring microarchitectural fairness

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Dynamic memory instruction bypassing

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Reconsidering Complex Branch Predictors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Half-price architecture

Proceedings of the 30th annual international symposium on Computer architecture
Energy efficient co-adaptive instruction fetch and issue

Proceedings of the 30th annual international symposium on Computer architecture
Using Interaction Costs for Microarchitectural Bottleneck Analysis

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
WaveScalar

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Macro-op Scheduling: Relaxing Scheduling Loop Constraints

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Optimum Power/Performance Pipeline Depth

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Near-Optimal Precharging in High-Performance Nanoscale CMOS Caches

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
TLC: Transmission Line Caches

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
A reconfigurable unit for a clustered programmable-reconfigurable processor

FPGA '04 Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays
Billion-Transistor Architectures: There and Back Again

Computer
Wire Delay is Not a Problem for SMT (In the Near Future)

Proceedings of the 31st annual international symposium on Computer architecture
Adaptive Cache Compression for High-Performance Processors

Proceedings of the 31st annual international symposium on Computer architecture
Use-Based Register Caching with Decoupled Indexing

Proceedings of the 31st annual international symposium on Computer architecture
Physical Register Inlining

Proceedings of the 31st annual international symposium on Computer architecture
A First-Order Superscalar Processor Model

Proceedings of the 31st annual international symposium on Computer architecture
A Complexity-Effective Approach to ALU Bandwidth Enhancement for Instruction-Level Temporal Redundancy

Proceedings of the 31st annual international symposium on Computer architecture
Scaling to the End of Silicon with EDGE Architectures

Computer
A low-complexity fetch architecture for high-performance superscalar processors

ACM Transactions on Architecture and Code Optimization (TACO)
Microarchitectural power modeling techniques for deep sub-micron microprocessors

Proceedings of the 2004 international symposium on Low power electronics and design
Power-optimal pipelining in deep submicron technology

Proceedings of the 2004 international symposium on Low power electronics and design
New methodology for early-stage, microarchitecture-level power-performance analysis of microprocessors

IBM Journal of Research and Development
How accurate should early design stage power/performance tools be? A case study with statistical simulation

Journal of Systems and Software - Special issue: Performance modeling and analysis of computer systems and networks
Alloyed branch history: combining global and local branch history for robust performance

International Journal of Parallel Programming
Interaction cost and shotgun profiling

ACM Transactions on Architecture and Code Optimization (TACO)
Static Placement, Dynamic Issue (SPDI) Scheduling for EDGE Architectures

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
The optimum pipeline depth considering both power and performance

ACM Transactions on Architecture and Code Optimization (TACO)
Increasing design space of the instruction queue with tag coding

GLSVLSI '05 Proceedings of the 15th ACM Great Lakes symposium on VLSI
Increased Scalability and Power Efficiency by Using Multiple Speed Pipelines

Proceedings of the 32nd annual international symposium on Computer Architecture
Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors

Proceedings of the 32nd annual international symposium on Computer Architecture
Microprocessor Design Issues: Thoughts on the Road Ahead

IEEE Micro
Future processors: flexible and modular

CODES+ISSS '05 Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Performance/Watt: the new server focus

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
An automated design flow for 3D microarchitecture evaluation

ASP-DAC '06 Proceedings of the 2006 Asia and South Pacific Design Automation Conference
Dynamic memory instruction bypassing

International Journal of Parallel Programming - Special issue I: The 17th annual international conference on supercomputing (ICS'03)
Microarchitecture evaluation with floorplanning and interconnect pipelining

Proceedings of the 2005 Asia and South Pacific Design Automation Conference
Total power-optimal pipelining and parallel processing under process variations in nanometer technology

ICCAD '05 Proceedings of the 2005 IEEE/ACM International conference on Computer-aided design
A scalable low power issue queue for large instruction window processors

Proceedings of the 20th annual international conference on Supercomputing
Mitigating the Impact of Process Variations on Processor Register Files and Execution Units

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
A wire delay-tolerant reconfigurable unit for a clustered programmable-reconfigurable processor

Microprocessors & Microsystems
The WaveScalar architecture

ACM Transactions on Computer Systems (TOCS)
ReCycle:: pipeline adaptation to tolerate process variation

Proceedings of the 34th annual international symposium on Computer architecture
Ginger: control independence using tag rewriting

Proceedings of the 34th annual international symposium on Computer architecture
Tradeoff between data-, instruction-, and thread-level parallelism in stream processors

Proceedings of the 21st annual international conference on Supercomputing
Enlarging Instruction Streams

IEEE Transactions on Computers
Scalable Dynamic Instruction Scheduler through Wake-Up Spatial Locality

IEEE Transactions on Computers
Design automation of real-life asynchronous devices and systems

Foundations and Trends in Electronic Design Automation
Alternative dataflow model

ACST'07 Proceedings of the third conference on IASTED International Conference: Advances in Computer Science and Technology
Analysis of static and dynamic energy consumption in NUCA caches: initial results

MEDEA '07 Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture
Optimal Power/Performance Pipeline Depth for SMT in Scaled Technologies

IEEE Transactions on Computers
High-performance and low-power VLIW cores for numerical computations

International Journal of High Performance Computing and Networking
A latency-conscious SMT branch prediction architecture

International Journal of High Performance Computing and Networking
Optimal pipeline depth with pipeline stage unification adoption

ACM SIGARCH Computer Architecture News - Special issue: ALPS '07---advanced low power systems
Dynamic configuration of application-specific implicit instructions for embedded pipelined processors

Proceedings of the 2008 ACM symposium on Applied computing
Power-efficient clustering via incomplete bypassing

Proceedings of the 13th international symposium on Low power electronics and design
A low-complexity microprocessor design with speculative pre-execution

Journal of Systems Architecture: the EUROMICRO Journal
A comparative study between static and dynamic sleep signal generation techniques for leakage tolerant designs

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
A Dynamic Control Mechanism for Pipeline Stage Unification by Identifying Program Phases

IEICE - Transactions on Information and Systems
A mechanistic performance model for superscalar out-of-order processors

ACM Transactions on Computer Systems (TOCS)
Accurate Instruction Pre-scheduling in Dynamically Scheduled Processors

Transactions on High-Performance Embedded Architectures and Compilers II
Area-efficiency in CMP core design: co-optimization of microarchitecture and physical design

ACM SIGARCH Computer Architecture News
Characterizing asynchronous variable latencies through probability distribution functions

Microprocessors & Microsystems
Multiple stream prediction

ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
Program phase detection based dynamic control mechanisms for pipeline stage unification adoption

ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
On ATPG for multiple aggressor crosstalk faults

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Fast, Efficient Floating-Point Adders and Multipliers for FPGAs

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Applied inference: Case studies in microarchitectural design

ACM Transactions on Architecture and Code Optimization (TACO)
Automatic microarchitectural pipelining

Proceedings of the Conference on Design, Automation and Test in Europe
Exploiting narrow-width values for thermal-aware register file designs

Proceedings of the Conference on Design, Automation and Test in Europe
On the power management of simultaneous multithreading processors

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Comparing FPGA vs. custom cmos and the impact on processor microarchitecture

Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays
Wake-up logic optimizations through selective match and wakeup range limitation

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
On the exploitation of narrow-width values for improving register file reliability

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Analysis and comparison in the energy-delay-area domain of nanometer CMOS flip-flops: part I-methodology and design strategies

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Simulating a LAGS processor to consider variable latency on L1 D-Cache

Proceedings of the 2010 Summer Computer Simulation Conference
A study on factors influencing power consumption in multithreaded and multicore CPUs

WSEAS Transactions on Computers
Pipeline strategy for improving optimal energy efficiency in ultra-low voltage design

Proceedings of the 48th Design Automation Conference
CPU DB: recording microprocessor history

Communications of the ACM
CPU DB: Recording Microprocessor History

Queue - Processors
Looking back and looking forward: power, performance, and upheaval

Communications of the ACM
Overcoming single-thread performance hurdles in the core fusion reconfigurable multicore architecture

Proceedings of the 26th ACM international conference on Supercomputing
Architecture Optimization of Application-Specific Implicit Instructions

ACM Transactions on Embedded Computing Systems (TECS) - Special Section on CAPA'09, Special Section on WHS'09, and Special Section VCPSS' 09
High performance and low power design techniques for ASIC and custom in nanometer technologies

Proceedings of the 2013 ACM international symposium on International symposium on physical design

Quantified Score

Hi-index	0.05

Visualization

Abstract

Microprocessor clock frequency has improved by nearly 40% annually over the past decade. This improvement has been provided, in equal measure, by smaller technologies and deeper pipelines. From our study of the SPEC 2000 benchmarks, we find that for a high-performance architecture implemented in 100nm technology, the optimal clock period is approximately 8 fan-out-of-four (FO4) inverter delays for integer benchmarks, comprised of 6 FO4 of useful work and an overhead of about 2 FO4. The optimal clock period for floating-point benchmarks is 6 FO4. We find these optimal points to be insensitive to latch and clock skew overheads. Our study indicates that further pipelining can at best improve performance of integer programs by a factor of 2 over current designs. At these high clock frequencies it will be difficult to design the instruction issue window to operate in a single cycle. Consequently, we propose and evaluate a high-frequency design called a segmented instruction window.