Continuous profiling: where have all the cycles gone?
ACM Transactions on Computer Systems (TOCS)
ProfileMe: hardware support for instruction-level profiling on out-of-order processors
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Pentium 4 Performance-Monitoring Features
IEEE Micro
Exploring Instruction-Fetch Bandwidth Requirement in Wide-Issue Superscalar Processors
PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
A First-Order Superscalar Processor Model
Proceedings of the 31st annual international symposium on Computer architecture
Interaction cost and shotgun profiling
ACM Transactions on Architecture and Code Optimization (TACO)
A performance counter architecture for computing accurate CPI components
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
The Inhibition of Potential Parallelism by Conditional Jumps
IEEE Transactions on Computers
A mechanistic performance model for superscalar out-of-order processors
ACM Transactions on Computer Systems (TOCS)
Investigating the impact of code generation on performance characteristics of integer programs
Proceedings of the 2010 Workshop on Interaction between Compilers and Computer Architecture
Bias scheduling in heterogeneous multi-core architectures
Proceedings of the 5th European conference on Computer systems
Pruning hardware evaluation space via correlation-driven application similarity analysis
Proceedings of the 8th ACM International Conference on Computing Frontiers
CRQ-based fair scheduling on composable multicore architectures
Proceedings of the 26th ACM international conference on Supercomputing
Hi-index | 0.00 |
Software developers can gain insight into software-hardware interactions by decomposing processor performance into individual cycles-per-instruction components that differentiate cycles consumed in active computation from those spent handling various miss events. Constructing accurate CPI components for out-of-order superscalar processors is complicated, however, because computation and miss event handling overlap. The authors' counter architecture, using an analytical superscalar performance model, handles overlap effects more accurately than existing methods.