Interaction cost and shotgun profiling

Authors:
Brian A. Fields;Rastislav Bodik;Mark D. Hill;Chris J. Newburn
Affiliations:
University of California, Berkeley, CA;University of California, Berkeley, CA;University of Wisconsin, Madison, WI;Intel Corporation
Venue:
ACM Transactions on Architecture and Code Optimization (TACO)
Year:
2004

Citing 34
Cited 12

Hierarchical performance modeling with MACS: a case study of the convex C-240

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
The impact of architectural trends on operating system performance

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Exceeding the dataflow limit via value prediction

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Dynamic instruction reuse

Proceedings of the 24th annual international symposium on Computer architecture
Continuous profiling: where have all the cycles gone?

ACM Transactions on Computer Systems (TOCS)
ProfileMe: hardware support for instruction-level profiling on out-of-order processors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Improving trace cache effectiveness with branch promotion and trace packing

Proceedings of the 25th annual international symposium on Computer architecture
Load latency tolerance in dynamically scheduled processors

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Selective value prediction

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
A scalable approach to thread-level speculation

Proceedings of the 27th annual international symposium on Computer architecture
Performance analysis using the MIPS R10000 performance counters

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Execution-based prediction using speculative slices

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Speculative precomputation: long-range prefetching of delinquent loads

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Focusing processor policies via critical-path prediction

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Locality vs. criticality

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Power and energy reduction via pipeline balancing

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
The optimum pipeline depth for a microprocessor

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Increasing processor performance by implementing deeper pipelines

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Slack: maximizing performance under technological constraints

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Performance characterization of a hardware mechanism for dynamic optimization

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Reducing power with dynamic critical path information

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Speculative lock elision: enabling highly concurrent multithreaded execution

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Computer architecture: a quantitative approach

Computer architecture: a quantitative approach
Joint local and global hardware adaptations for energy

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Pentium 4 Performance-Monitoring Features

IEEE Micro
Quantifying Instruction Criticality

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
The Impact of Instruction-Level Parallelism on Multiprocessor Performance and Simulation Methodology

HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
A Statistically Rigorous Approach for Improving Simulation Methodology

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Dynamic Prediction of Critical Path Instructions

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Non-vital Loads

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Loose Loops Sink Chips

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Energy-Efficient Processor Design Using Multiple Clock Domains with Dynamic Voltage and Frequency Scaling

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
A First-Order Superscalar Processor Model

Proceedings of the 31st annual international symposium on Computer architecture

Interaction Cost: For When Event Counts Just Don't Add Up

IEEE Micro
A performance counter architecture for computing accurate CPI components

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Efficiently exploring architectural design spaces via predictive modeling

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
An analysis of the effects of miss clustering on the cost of a cache miss

Proceedings of the 4th international conference on Computing frontiers
A Top-Down Approach to Architecting CPI Component Performance Counters

IEEE Micro
Pipeline spectroscopy

Proceedings of the 2007 workshop on Experimental computer science
Pipeline spectroscopy

ecs'07 Experimental computer science on Experimental computer science
Efficient architectural design space exploration via predictive modeling

ACM Transactions on Architecture and Code Optimization (TACO)
Per-thread cycle accounting in SMT processors

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
A mechanistic performance model for superscalar out-of-order processors

ACM Transactions on Computer Systems (TOCS)
Hardware performance monitoring for the rest of us: a position and survey

NPC'11 Proceedings of the 8th IFIP international conference on Network and parallel computing
Computer memory: why we should care what is under the hood

MEMICS'11 Proceedings of the 7th international conference on Mathematical and Engineering Methods in Computer Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

We observe that the challenges software optimizers and microarchitects face every day boil down to a single problem: bottleneck analysis. A bottleneck is any event or resource that contributes to execution time, such as a critical cache miss or window stall. Tasks such as tuning processors for energy efficiency and finding the right loads to prefetch all require measuring the performance costs of bottlenecks.In the past, simple event counts were enough to find the important bottlenecks. Today, the parallelism of modern processors makes such analysis much more difficult, rendering traditional performance counters less useful. If two microarchitectural events (such as a fetch stall and a cache miss) occur in the same cycle, which event should we blame for the cycle? What cost should we assign to each event? In this paper, we introduce a new model for understanding event costs to facilitate processor design and optimization.First, we observe that all instructions, hardware structures, and events in a machine can interact in only one of two ways (in parallel or serially). We quantify these interactions by defining interaction cost, which can be zero (independent, no interaction), positive (parallel), or negative (serial).Second, we illustrate the value of using interaction costs in processor design and optimization. In a processor with a long pipeline, we show how to mitigate the negative performance effect of long latency "critical" loops, such as the level-one cache access and issue-wakeup, by optimizing seemingly unrelated resources that interact with them.Finally, we propose shotgun profiling, a class of hardware profiling infrastructures that are parallelism-aware, in contrast to traditional event counters. Our recommended design requires only modest extensions to current hardware counters, while enabling the construction of full-featured dependence graphs of the microexecution. With these dependence graphs, many types of analyses can be performed, including identifying critical instructions, finding slack, as well as computing costs and interaction costs.