A performance counter architecture for computing accurate CPI components

Authors:
Stijn Eyerman;Lieven Eeckhout;Tejas Karkhanis;James E. Smith
Affiliations:
Ghent University;Ghent University;University of Wisconsin-Madison;University of Wisconsin-Madison
Venue:
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Year:
2006

Citing 15
Cited 35

Theoretical modeling of superscalar processor performance

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Continuous profiling: where have all the cycles gone?

ACM Transactions on Computer Systems (TOCS)
ProfileMe: hardware support for instruction-level profiling on out-of-order processors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Performance characterization of a Quad Pentium Pro SMP using OLTP workloads

Proceedings of the 25th annual international symposium on Computer architecture
Performance of database workloads on shared-memory systems with out-of-order processors

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Performance analysis using the MIPS R10000 performance counters

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
The optimum pipeline depth for a microprocessor

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Pentium 4 Performance-Monitoring Features

IEEE Micro
Benchmarking Internet Servers on Superscalar Machines

Computer
DBMSs on a Modern Processor: Where Does Time Go?

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
A Framework for Statistical Modeling of Superscalar Processor Performance

HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Exploring Instruction-Fetch Bandwidth Requirement in Wide-Issue Superscalar Processors

PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
An Instruction Throughput Model of Superscalar Processors

RSP '03 Proceedings of the 14th IEEE International Workshop on Rapid System Prototyping (RSP'03)
A First-Order Superscalar Processor Model

Proceedings of the 31st annual international symposium on Computer architecture
Interaction cost and shotgun profiling

ACM Transactions on Architecture and Code Optimization (TACO)

Automated design of application specific superscalar processors: an analytical approach

Proceedings of the 34th annual international symposium on Computer architecture
A Top-Down Approach to Architecting CPI Component Performance Counters

IEEE Micro
Dynamic voltage frequency scaling for multi-tasking systems using online learning

ISLPED '07 Proceedings of the 2007 international symposium on Low power electronics and design
Predictive design space exploration using genetically programmed response surfaces

Proceedings of the 45th annual Design Automation Conference
A dollar from 15 cents: cross-platform management for internet services

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Exploring and predicting the architecture/optimising compiler co-design space

CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
Multi-optimization power management for chip multiprocessors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Effective performance measurement and analysis of multithreaded applications

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Per-thread cycle accounting in SMT processors

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
A mechanistic performance model for superscalar out-of-order processors

ACM Transactions on Computer Systems (TOCS)
Dynamic performance tuning for speculative threads

Proceedings of the 36th annual international symposium on Computer architecture
System-level power management using online learning

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Portable compiler optimisation across embedded programs and microarchitectures using machine learning

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Studying compiler optimizations on superscalar processors through interval analysis

HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
Efficient interaction between OS and architecture in heterogeneous platforms

ACM SIGOPS Operating Systems Review
Modeling program resource demand using inherent program characteristics

Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
HeteroScouts: hardware assist for OS scheduling in heterogeneous CMPs

Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Predictive coordination of multiple on-chip resources for chip multiprocessors

Proceedings of the international conference on Supercomputing
Modeling program resource demand using inherent program characteristics

ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
HeteroScouts: hardware assist for OS scheduling in heterogeneous CMPs

ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs

ACM Transactions on Architecture and Code Optimization (TACO)
Clearing the clouds: a study of emerging scale-out workloads on modern hardware

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Exploring and Predicting the Effects of Microarchitectural Parameters and Compiler Optimizations on Performance and Energy

ACM Transactions on Embedded Computing Systems (TECS)
An efficient CPI stack counter architecture for superscalar processors

Proceedings of the great lakes symposium on VLSI
Probabilistic shared cache management (PriSM)

Proceedings of the 39th Annual International Symposium on Computer Architecture
Dynamically dispatching speculative threads to improve sequential execution

ACM Transactions on Architecture and Code Optimization (TACO)
Quantifying the Mismatch between Emerging Scale-Out Applications and Modern Processors

ACM Transactions on Computer Systems (TOCS)
Fair CPU time accounting in CMP+SMT processors

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Microarchitectural design space exploration made fast

Microprocessors & Microsystems
From A to E: analyzing TPC's OLTP benchmarks: the obsolete, the ubiquitous, the unexplored

Proceedings of the 16th International Conference on Extending Database Technology
Inferred Models for Dynamic and Sparse Hardware-Software Spaces

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
OLTP in wonderland: where do cache misses come from in major OLTP components?

Proceedings of the Ninth International Workshop on Data Management on New Hardware
Framework for a productive performance optimization

Parallel Computing
Ubik: efficient cache sharing with strict qos for latency-critical workloads

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

A common way of representing processor performance is to use Cycles per Instruction (CPI) `stacks' which break performance into a baseline CPI plus a number of individual miss event CPI components. CPI stacks can be very helpful in gaining insight into the behavior of an application on a given microprocessor; consequently, they are widely used by software application developers and computer architects. However, computing CPI stacks on superscalar out-of-order processors is challenging because of various overlaps among execution and miss events (cache misses, TLB misses, and branch mispredictions).This paper shows that meaningful and accurate CPI stacks can be computed for superscalar out-of-order processors. Using interval analysis, a novel method for analyzing out-of-order processor performance, we gain understanding into the performance impact of the various miss events. Based on this understanding, we propose a novel way of architecting hardware performance counters for building accurate CPI stacks. The additional hardware for implementing these counters is limited and comparable to existing hardware performance counter architectures while being significantly more accurate than previous approaches.