Increasing processor performance by implementing deeper pipelines

Authors:
Eric Sprangle;Doug Carmean
Affiliations:
Intel Corporation;Intel Corporation
Venue:
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Year:
2002

Citing 4
Cited 86

Multiple-block ahead branch predictors

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Alternative fetch and issue policies for the trace cache fetch mechanism

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Skew-tolerant circuit design

Skew-tolerant circuit design
The Alpha 21264 Microprocessor

IEEE Micro

Predicting Conditional Branches With Fusion-Based Hybrid Predictors

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Dynamic addressing memory arrays with physical locality

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Optimizing pipelines for power and performance

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Exploiting data-width locality to increase superscalar execution bandwidth

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Microarchitectural denial of service: insuring microarchitectural fairness

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Microarchitecture evaluation with physical planning

Proceedings of the 40th annual Design Automation Conference
Phi-Predication for light-weight if-conversion

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Recycling waste: exploiting wrong-path execution to improve branch prediction

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Dynamic memory instruction bypassing

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Enhancing memory level parallelism via recovery-free value prediction

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Microarchitecture and Performance Analysis of a SPARC-V9 Microprocessor for Enterprise Server Systems

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Dynamic Data Dependence Tracking and its Application to Branch Prediction

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Improving branch prediction by dynamic dataflow-based identification of correlated branches from a large global history

Proceedings of the 30th annual international symposium on Computer architecture
Detecting global stride locality in value streams

Proceedings of the 30th annual international symposium on Computer architecture
On-chip communication design: roadblocks and avenues

Proceedings of the 1st IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Using Interaction Costs for Microarchitectural Bottleneck Analysis

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Scalable Hardware Memory Disambiguation for High ILP Processors

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Fast Path-Based Neural Branch Prediction

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Optimum Power/Performance Pipeline Depth

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Wire Delay is Not a Problem for SMT (In the Near Future)

Proceedings of the 31st annual international symposium on Computer architecture
Prophet/Critic Hybrid Branch Prediction

Proceedings of the 31st annual international symposium on Computer architecture
Use-Based Register Caching with Decoupled Indexing

Proceedings of the 31st annual international symposium on Computer architecture
Physical Register Inlining

Proceedings of the 31st annual international symposium on Computer architecture
A First-Order Superscalar Processor Model

Proceedings of the 31st annual international symposium on Computer architecture
Power-optimal pipelining in deep submicron technology

Proceedings of the 2004 international symposium on Low power electronics and design
New methodology for early-stage, microarchitecture-level power-performance analysis of microprocessors

IBM Journal of Research and Development
Alloyed branch history: combining global and local branch history for robust performance

International Journal of Parallel Programming
Interaction cost and shotgun profiling

ACM Transactions on Architecture and Code Optimization (TACO)
Static Placement, Dynamic Issue (SPDI) Scheduling for EDGE Architectures

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Dynamically Trading Frequency for Complexity in a GALS Microprocessor

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Effects of speculation on performance and issue queue design

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
The optimum pipeline depth considering both power and performance

ACM Transactions on Architecture and Code Optimization (TACO)
An analysis of a resource efficient checkpoint architecture

ACM Transactions on Architecture and Code Optimization (TACO)
Improved latency and accuracy for neural branch prediction

ACM Transactions on Computer Systems (TOCS)
Increased Scalability and Power Efficiency by Using Multiple Speed Pipelines

Proceedings of the 32nd annual international symposium on Computer Architecture
Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors

Proceedings of the 32nd annual international symposium on Computer Architecture
Piecewise Linear Branch Prediction

Proceedings of the 32nd annual international symposium on Computer Architecture
Enhancing Memory-Level Parallelism via Recovery-Free Value Prediction

IEEE Transactions on Computers
Fast branch misprediction recovery in out-of-order superscalar processors

Proceedings of the 19th annual international conference on Supercomputing
A Simple Divide-and-Conquer Approach for Neural-Class Branch Prediction

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Address-Value Delta (AVD) Prediction: Increasing the Effectiveness of Runahead Execution by Exploiting Regular Memory Allocation Patterns

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Dynamically configurable shared CMP helper engines for improved performance

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Control Speculation for Energy-Efficient Next-Generation Superscalar Processors

IEEE Transactions on Computers
An automated design flow for 3D microarchitecture evaluation

ASP-DAC '06 Proceedings of the 2006 Asia and South Pacific Design Automation Conference
Dynamic memory instruction bypassing

International Journal of Parallel Programming - Special issue I: The 17th annual international conference on supercomputing (ICS'03)
Microarchitecture evaluation with floorplanning and interconnect pipelining

Proceedings of the 2005 Asia and South Pacific Design Automation Conference
Total power-optimal pipelining and parallel processing under process variations in nanometer technology

ICCAD '05 Proceedings of the 2005 IEEE/ACM International conference on Computer-aided design
Reducing Rename Logic Complexity for High-Speed and Low-Power Front-End Architectures

IEEE Transactions on Computers
Diverge-Merge Processor (DMP): Dynamic Predicated Execution of Complex Control-Flow Graphs Based on Frequently Executed Paths

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Dynamic per-branch history length adjustment to improve branch prediction accuracy

Microprocessors & Microsystems
Compacting register file via 2-level renaming and bit-partitioning

Microprocessors & Microsystems
ReCycle:: pipeline adaptation to tolerate process variation

Proceedings of the 34th annual international symposium on Computer architecture
Implementation and Evaluation of a Dynamically Routed Processor Operand Network

NOCS '07 Proceedings of the First International Symposium on Networks-on-Chip
Optimal Power/Performance Pipeline Depth for SMT in Scaled Technologies

IEEE Transactions on Computers
Fetch-Criticality Reduction through Control Independence

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Branch predictor on-line evolutionary system

Proceedings of the 10th annual conference on Genetic and evolutionary computation
Hiding cache miss penalty using priority-based execution for embedded processors

Proceedings of the conference on Design, automation and test in Europe
Investigating the effects of fine-grain three-dimensional integration on microarchitecture design

ACM Journal on Emerging Technologies in Computing Systems (JETC)
Generalizing neural branch prediction

ACM Transactions on Architecture and Code Optimization (TACO)
A criticality-driven microarchitectural three dimensional (3D) floorplanner

Proceedings of the 2009 Asia and South Pacific Design Automation Conference
Shapeshifter: Dynamically changing pipeline width and speed to address process variations

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Making effective decisions in computer architects' real-world: lessons and experiences with Godson-2 processor designs

Journal of Computer Science and Technology
A mechanistic performance model for superscalar out-of-order processors

ACM Transactions on Computer Systems (TOCS)
Complexity Effective Bypass Networks

Transactions on High-Performance Embedded Architectures and Compilers II
Implementing a 1GHz four-issue out-of-order execution microprocessor in a standard cell ASIC methodology

Journal of Computer Science and Technology
Area-efficiency in CMP core design: co-optimization of microarchitecture and physical design

ACM SIGARCH Computer Architecture News
Reducing impact of cache miss stalls in embedded systems by extracting guaranteed independent instructions

CASES '09 Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems
Configuring a real time radio signal processor on an embedded system using compiled XML

SIP '07 Proceedings of the Ninth IASTED International Conference on Signal and Image Processing
Finding representative workloads for computer system design

Finding representative workloads for computer system design
Reducing branch misprediction penalties via adaptive pipeline scaling

HiPEAC'07 Proceedings of the 2nd international conference on High performance embedded architectures and compilers
Empowering a helper cluster through data-width aware instruction selection policies

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Register Cache System Not for Latency Reduction Purpose

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Comparing FPGA vs. custom cmos and the impact on processor microarchitecture

Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays
Optimizing integrated application performance with cache-aware metascheduling

OTM'11 Proceedings of the 2011th Confederated international conference on On the move to meaningful internet systems - Volume Part II
2L-MuRR: a compact register renaming scheme for SMT processors

ISPA'05 Proceedings of the Third international conference on Parallel and Distributed Processing and Applications
Misleading energy and performance claims in sub/near threshold digital systems

Proceedings of the International Conference on Computer-Aided Design
CPU DB: recording microprocessor history

Communications of the ACM
Single FU bypass networks for high clock rate superscalar processors

HiPC'04 Proceedings of the 11th international conference on High Performance Computing
Micro-architecture performance estimation by formula

SAMOS'05 Proceedings of the 5th international conference on Embedded Computer Systems: architectures, Modeling, and Simulation
CPU DB: Recording Microprocessor History

Queue - Processors
Trace execution automata in dynamic binary translation

ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Design space exploration of hybrid ultra low power branch predictors

ARCS'12 Proceedings of the 25th international conference on Architecture of Computing Systems
Predicting Performance Impact of DVFS for Realistic Memory Systems

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Performance analysis of multi-threaded multi-core CPUs

Proceedings of the First International Workshop on Many-core Embedded Systems

Quantified Score

Hi-index	0.03

Visualization

Abstract

One architectural method for increasing processor performance involves increasing the frequency by implementing deeper pipelines. This paper will explore the relationship between performance and pipeline depth using a Pentium® 4 processor like architecture as a baseline and will show that deeper pipelines can continue to increase performance.This paper will show that the branch misprediction latency is the single largest contributor to performance degradation as pipelines are stretched, and therefore branch prediction and fast branch recovery will continue to increase in importance. We will also show that higher performance cores, implemented with longer pipelines for example, will put more pressure on the memory system, and therefore require larger on-chip caches. Finally, we will show that in the same process technology, designing deeper pipelines can increase the processor frequency by 100%, which, when combined with larger on-chip caches can yield performance improvements of 35% to 90% over a Pentium® 4 like processor.