A mechanistic performance model for superscalar out-of-order processors

Authors:
Stijn Eyerman;Lieven Eeckhout;Tejas Karkhanis;James E. Smith
Affiliations:
Ghent University, Ghent, Belgium;Ghent University, Ghent, Belgium;Advanced Micro Devices, Sunnyvale, CA;University of Wisconsin -- Madison, Madison, WI
Venue:
ACM Transactions on Computer Systems (TOCS)
Year:
2009

Citing 38
Cited 17

Optimal pipelining in supercomputers

ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Characterization of branch and data dependencies on programs for evaluating pipeline performance

IEEE Transactions on Computers
Optimal pipelining

Journal of Parallel and Distributed Computing
Limits of instruction-level parallelism

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Theoretical modeling of superscalar processor performance

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
The SimpleScalar tool set, version 2.0

ACM SIGARCH Computer Architecture News
Understanding some simple processor-performance limits

IBM Journal of Research and Development - Special issue: performance analysis and its impact on design
Analytic evaluation of shared-memory systems with ILP processors

Proceedings of the 25th annual international symposium on Computer architecture
Clock rate versus IPC: the end of the road for conventional microarchitectures

Proceedings of the 27th annual international symposium on Computer architecture
An exploration of instruction fetch requirement in out-of-order superscalar processors

International Journal of Parallel Programming - parallel architectures and compilation techniques, part II
The optimum pipeline depth for a microprocessor

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Increasing processor performance by implementing deeper pipelines

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Automatically characterizing large scale program behavior

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Instruction Window Size Trade-Offs and Characterization of Program Parallelism

IEEE Transactions on Computers
Hybrid Analytical-Statistical Modeling for Efficiently Exploring Architecture and Workload Design Spaces

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Optimizing pipelines for power and performance

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
A Framework for Statistical Modeling of Superscalar Processor Performance

HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Exploring Instruction-Fetch Bandwidth Requirement in Wide-Issue Superscalar Processors

PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
An Instruction Throughput Model of Superscalar Processors

RSP '03 Proceedings of the 14th IEEE International Workshop on Rapid System Prototyping (RSP'03)
Miss Rate Prediction across All Program Inputs

Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
Optimum Power/Performance Pipeline Depth

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Microarchitecture Optimizations for Exploiting Memory-Level Parallelism

Proceedings of the 31st annual international symposium on Computer architecture
A First-Order Superscalar Processor Model

Proceedings of the 31st annual international symposium on Computer architecture
Interaction cost and shotgun profiling

ACM Transactions on Architecture and Code Optimization (TACO)
Continual flow pipelines

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Toward kilo-instruction processors

ACM Transactions on Architecture and Code Optimization (TACO)
Fast data-locality profiling of native execution

SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
An analytical model for cache replacement policy performance

SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
A performance counter architecture for computing accurate CPI components

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Accurate and efficient regression modeling for microarchitectural performance and power prediction

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Efficiently exploring architectural design spaces via predictive modeling

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
A Predictive Performance Model for Superscalar Processors

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Automated design of application specific superscalar processors: an analytical approach

Proceedings of the 34th annual international symposium on Computer architecture
A Top-Down Approach to Architecting CPI Component Performance Counters

IEEE Micro
The Inhibition of Potential Parallelism by Conditional Jumps

IEEE Transactions on Computers
An Instruction Throughput Model of Superscalar Processors

IEEE Transactions on Computers

Interval-based models for run-time DVFS orchestration in superscalar processors

Proceedings of the 7th ACM international conference on Computing frontiers
Applied inference: Case studies in microarchitectural design

ACM Transactions on Architecture and Code Optimization (TACO)
Fast modeling of shared caches in multicore systems

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Efficient interaction between OS and architecture in heterogeneous platforms

ACM SIGOPS Operating Systems Review
Fine-grained DVFS using on-chip regulators

ACM Transactions on Architecture and Code Optimization (TACO)
A statistical performance model of the opteron processor

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Modeling program resource demand using inherent program characteristics

Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Modeling program resource demand using inherent program characteristics

ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
Sniper: exploring the level of abstraction for scalable and accurate parallel multi-core simulation

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
An efficient CPI stack counter architecture for superscalar processors

Proceedings of the great lakes symposium on VLSI
Scheduling heterogeneous multi-cores through Performance Impact Estimation (PIE)

Proceedings of the 39th Annual International Symposium on Computer Architecture
A first-order mechanistic model for architectural vulnerability factor

Proceedings of the 39th Annual International Symposium on Computer Architecture
Per-thread cycle accounting in multicore processors

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Efficient Reuse Distance Analysis of Multicore Scaling for Loop-Based Parallel Programs

ACM Transactions on Computer Systems (TOCS)
Accurately modeling superscalar processor performance with reduced trace

Journal of Parallel and Distributed Computing
Predicting Performance Impact of DVFS for Realistic Memory Systems

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Power-performance modeling on asymmetric multi-cores

Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

A mechanistic model for out-of-order superscalar processors is developed and then applied to the study of microarchitecture resource scaling. The model divides execution time into intervals separated by disruptive miss events such as branch mispredictions and cache misses. Each type of miss event results in characterizable performance behavior for the execution time interval. By considering an interval's type and length (measured in instructions), execution time can be predicted for the interval. Overall execution time is then determined by aggregating the execution time over all intervals. The mechanistic model provides several advantages over prior modeling approaches, and, when estimating performance, it differs from detailed simulation of a 4-wide out-of-order processor by an average of 7%. The mechanistic model is applied to the general problem of resource scaling in out-of-order superscalar processors. First, we use the model to determine size relationships among microarchitecture structures in a balanced processor design. Second, we use the mechanistic model to study scaling of both pipeline depth and width in balanced processor designs. We corroborate previous results in this area and provide new results. For example, we show that at optimal design points, the pipeline depth times the square root of the processor width is nearly constant. Finally, we consider the behavior of unbalanced, overprovisioned processor designs based on insight gained from the mechanistic model. We show that in certain situations an overprovisioned processor may lead to improved overall performance. Designs where a processor's dispatch width is wider than its issue width are of particular interest.