Hierarchical performance modeling with MACS: a case study of the convex C-240
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
The impact of architectural trends on operating system performance
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Exceeding the dataflow limit via value prediction
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Proceedings of the 24th annual international symposium on Computer architecture
Continuous profiling: where have all the cycles gone?
ACM Transactions on Computer Systems (TOCS)
ProfileMe: hardware support for instruction-level profiling on out-of-order processors
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Improving trace cache effectiveness with branch promotion and trace packing
Proceedings of the 25th annual international symposium on Computer architecture
Load latency tolerance in dynamically scheduled processors
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
A scalable approach to thread-level speculation
Proceedings of the 27th annual international symposium on Computer architecture
Performance analysis using the MIPS R10000 performance counters
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Execution-based prediction using speculative slices
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Speculative precomputation: long-range prefetching of delinquent loads
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Focusing processor policies via critical-path prediction
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Power and energy reduction via pipeline balancing
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
The optimum pipeline depth for a microprocessor
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Increasing processor performance by implementing deeper pipelines
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Slack: maximizing performance under technological constraints
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Performance characterization of a hardware mechanism for dynamic optimization
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Reducing power with dynamic critical path information
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Speculative lock elision: enabling highly concurrent multithreaded execution
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Computer architecture: a quantitative approach
Computer architecture: a quantitative approach
Joint local and global hardware adaptations for energy
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Pentium 4 Performance-Monitoring Features
IEEE Micro
Quantifying Instruction Criticality
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
The Impact of Instruction-Level Parallelism on Multiprocessor Performance and Simulation Methodology
HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
A Statistically Rigorous Approach for Improving Simulation Methodology
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Dynamic Prediction of Critical Path Instructions
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
A First-Order Superscalar Processor Model
Proceedings of the 31st annual international symposium on Computer architecture
A performance counter architecture for computing accurate CPI components
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Efficiently exploring architectural design spaces via predictive modeling
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
An analysis of the effects of miss clustering on the cost of a cache miss
Proceedings of the 4th international conference on Computing frontiers
Proceedings of the 2007 workshop on Experimental computer science
ecs'07 Experimental computer science on Experimental computer science
Efficient architectural design space exploration via predictive modeling
ACM Transactions on Architecture and Code Optimization (TACO)
Per-thread cycle accounting in SMT processors
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
A mechanistic performance model for superscalar out-of-order processors
ACM Transactions on Computer Systems (TOCS)
Hardware performance monitoring for the rest of us: a position and survey
NPC'11 Proceedings of the 8th IFIP international conference on Network and parallel computing
Computer memory: why we should care what is under the hood
MEMICS'11 Proceedings of the 7th international conference on Mathematical and Engineering Methods in Computer Science
Hi-index | 0.00 |
We observe that the challenges software optimizers and microarchitects face every day boil down to a single problem: bottleneck analysis. A bottleneck is any event or resource that contributes to execution time, such as a critical cache miss or window stall. Tasks such as tuning processors for energy efficiency and finding the right loads to prefetch all require measuring the performance costs of bottlenecks.In the past, simple event counts were enough to find the important bottlenecks. Today, the parallelism of modern processors makes such analysis much more difficult, rendering traditional performance counters less useful. If two microarchitectural events (such as a fetch stall and a cache miss) occur in the same cycle, which event should we blame for the cycle? What cost should we assign to each event? In this paper, we introduce a new model for understanding event costs to facilitate processor design and optimization.First, we observe that all instructions, hardware structures, and events in a machine can interact in only one of two ways (in parallel or serially). We quantify these interactions by defining interaction cost, which can be zero (independent, no interaction), positive (parallel), or negative (serial).Second, we illustrate the value of using interaction costs in processor design and optimization. In a processor with a long pipeline, we show how to mitigate the negative performance effect of long latency "critical" loops, such as the level-one cache access and issue-wakeup, by optimizing seemingly unrelated resources that interact with them.Finally, we propose shotgun profiling, a class of hardware profiling infrastructures that are parallelism-aware, in contrast to traditional event counters. Our recommended design requires only modest extensions to current hardware counters, while enabling the construction of full-featured dependence graphs of the microexecution. With these dependence graphs, many types of analyses can be performed, including identifying critical instructions, finding slack, as well as computing costs and interaction costs.