Theoretical modeling of superscalar processor performance
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Continuous profiling: where have all the cycles gone?
ACM Transactions on Computer Systems (TOCS)
ProfileMe: hardware support for instruction-level profiling on out-of-order processors
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Performance characterization of a Quad Pentium Pro SMP using OLTP workloads
Proceedings of the 25th annual international symposium on Computer architecture
Performance of database workloads on shared-memory systems with out-of-order processors
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Performance analysis using the MIPS R10000 performance counters
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
The optimum pipeline depth for a microprocessor
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Pentium 4 Performance-Monitoring Features
IEEE Micro
DBMSs on a Modern Processor: Where Does Time Go?
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
A Framework for Statistical Modeling of Superscalar Processor Performance
HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Exploring Instruction-Fetch Bandwidth Requirement in Wide-Issue Superscalar Processors
PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
An Instruction Throughput Model of Superscalar Processors
RSP '03 Proceedings of the 14th IEEE International Workshop on Rapid System Prototyping (RSP'03)
A First-Order Superscalar Processor Model
Proceedings of the 31st annual international symposium on Computer architecture
Interaction cost and shotgun profiling
ACM Transactions on Architecture and Code Optimization (TACO)
Automated design of application specific superscalar processors: an analytical approach
Proceedings of the 34th annual international symposium on Computer architecture
Dynamic voltage frequency scaling for multi-tasking systems using online learning
ISLPED '07 Proceedings of the 2007 international symposium on Low power electronics and design
Predictive design space exploration using genetically programmed response surfaces
Proceedings of the 45th annual Design Automation Conference
A dollar from 15 cents: cross-platform management for internet services
ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Exploring and predicting the architecture/optimising compiler co-design space
CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
Multi-optimization power management for chip multiprocessors
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Effective performance measurement and analysis of multithreaded applications
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Per-thread cycle accounting in SMT processors
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
A mechanistic performance model for superscalar out-of-order processors
ACM Transactions on Computer Systems (TOCS)
Dynamic performance tuning for speculative threads
Proceedings of the 36th annual international symposium on Computer architecture
System-level power management using online learning
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Studying compiler optimizations on superscalar processors through interval analysis
HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
Efficient interaction between OS and architecture in heterogeneous platforms
ACM SIGOPS Operating Systems Review
Modeling program resource demand using inherent program characteristics
Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
HeteroScouts: hardware assist for OS scheduling in heterogeneous CMPs
Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Predictive coordination of multiple on-chip resources for chip multiprocessors
Proceedings of the international conference on Supercomputing
Modeling program resource demand using inherent program characteristics
ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
HeteroScouts: hardware assist for OS scheduling in heterogeneous CMPs
ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs
ACM Transactions on Architecture and Code Optimization (TACO)
Clearing the clouds: a study of emerging scale-out workloads on modern hardware
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
ACM Transactions on Embedded Computing Systems (TECS)
An efficient CPI stack counter architecture for superscalar processors
Proceedings of the great lakes symposium on VLSI
Probabilistic shared cache management (PriSM)
Proceedings of the 39th Annual International Symposium on Computer Architecture
Dynamically dispatching speculative threads to improve sequential execution
ACM Transactions on Architecture and Code Optimization (TACO)
Quantifying the Mismatch between Emerging Scale-Out Applications and Modern Processors
ACM Transactions on Computer Systems (TOCS)
Fair CPU time accounting in CMP+SMT processors
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Microarchitectural design space exploration made fast
Microprocessors & Microsystems
From A to E: analyzing TPC's OLTP benchmarks: the obsolete, the ubiquitous, the unexplored
Proceedings of the 16th International Conference on Extending Database Technology
Inferred Models for Dynamic and Sparse Hardware-Software Spaces
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
OLTP in wonderland: where do cache misses come from in major OLTP components?
Proceedings of the Ninth International Workshop on Data Management on New Hardware
Framework for a productive performance optimization
Parallel Computing
Ubik: efficient cache sharing with strict qos for latency-critical workloads
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Hi-index | 0.00 |
A common way of representing processor performance is to use Cycles per Instruction (CPI) `stacks' which break performance into a baseline CPI plus a number of individual miss event CPI components. CPI stacks can be very helpful in gaining insight into the behavior of an application on a given microprocessor; consequently, they are widely used by software application developers and computer architects. However, computing CPI stacks on superscalar out-of-order processors is challenging because of various overlaps among execution and miss events (cache misses, TLB misses, and branch mispredictions).This paper shows that meaningful and accurate CPI stacks can be computed for superscalar out-of-order processors. Using interval analysis, a novel method for analyzing out-of-order processor performance, we gain understanding into the performance impact of the various miss events. Based on this understanding, we propose a novel way of architecting hardware performance counters for building accurate CPI stacks. The additional hardware for implementing these counters is limited and comparable to existing hardware performance counter architectures while being significantly more accurate than previous approaches.