An efficient CPI stack counter architecture for superscalar processors

Authors:
Osman Allam;Stijn Eyerman;Lieven Eeckhout
Affiliations:
Ghent University, Ghent, Belgium;Ghent University, Ghent, Belgium;Ghent University, Ghent, Belgium
Venue:
Proceedings of the great lakes symposium on VLSI
Year:
2012

Citing 11
Cited 0

Continuous profiling: where have all the cycles gone?

ACM Transactions on Computer Systems (TOCS)
Understanding some simple processor-performance limits

IBM Journal of Research and Development - Special issue: performance analysis and its impact on design
Performance characterization of a Quad Pentium Pro SMP using OLTP workloads

Proceedings of the 25th annual international symposium on Computer architecture
Performance analysis using the MIPS R10000 performance counters

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
SimpleScalar: An Infrastructure for Computer System Modeling

Computer
Pentium 4 Performance-Monitoring Features

IEEE Micro
Benchmarking Internet Servers on Superscalar Machines

Computer
A performance counter architecture for computing accurate CPI components

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
A mechanistic performance model for superscalar out-of-order processors

ACM Transactions on Computer Systems (TOCS)
Probabilistic job symbiosis modeling for SMT processor scheduling

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
A Counter Architecture for Online DVFS Profitability Estimation

IEEE Transactions on Computers

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cycles-Per-Instruction (CPI) stacks provide intuitive and insightful performance information to software developers. Performance bottlenecks are easily identified from CPI stacks, which hint towards software changes for improving performance. Computing CPI stacks on contemporary superscalar processors is non-trivial though because of various overlap effects. Prior work proposed a CPI counter architecture for computing CPI stacks on out-of-order processors. The accuracy of the obtained CPI stacks was evaluated previously, however, the hardware overhead analysis was not based on a detailed hardware implementation. In this paper, we implement the previously proposed CPI counter architecture in hardware and we find that the previous design can be further optimized. We propose a novel hardware- and power-efficient CPI counter architecture that reduces chip area by 44% and power consumption by 47% over the best possible prior design, while maintaining nearly the same level of performance and accuracy.