Continuous profiling: where have all the cycles gone?
ACM Transactions on Computer Systems (TOCS)
ProfileMe: hardware support for instruction-level profiling on out-of-order processors
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Gprof: A call graph execution profiler
SIGPLAN '82 Proceedings of the 1982 SIGPLAN symposium on Compiler construction
Control Speculation in Multithreaded Processors through Dynamic Loop Detection
HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
Analyis of Path Profiling Information Generated with Performance Monitoring Hardware
INTERACT '05 Proceedings of the 9th Annual Workshop on Interaction between Compilers and Computer Architectures
Pin: building customized program analysis tools with dynamic instrumentation
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Runtime predictability of loops
WWC '01 Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop
Effective performance measurement and analysis of multithreaded applications
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Binary analysis for measurement and attribution of program performance
Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
On-the-fly detection of precise loop nests across procedures on a dynamic binary translation system
Proceedings of the 8th ACM International Conference on Computing Frontiers
Hardware performance monitoring for the rest of us: a position and survey
NPC'11 Proceedings of the 8th IFIP international conference on Network and parallel computing
A balanced approach to application performance tuning
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
ISAMAP: instruction mapping driven by dynamic binary translation
ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Harmony: collection and analysis of parallel block vectors
Proceedings of the 39th Annual International Symposium on Computer Architecture
Loop acceleration exploration for ASIP architecture
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Recovering memory access patterns of executable programs
Science of Computer Programming
Hi-index | 0.00 |
The transition to multithreaded, multi-core designs places a greater responsibility on programmers and software for improving performance; thread-level parallelism (TLP) will be increasingly relied upon in addition to instruction-level parallelism (ILP) and increased clock frequency. Deciding where to try to parallelize code is difficult, especially for large, complex applications or those where the original developers have moved on. Outer loops are relatively easy targets for parallelization, but traditional profilers focus primarily on functions and hot inner loops. To aid in programmers' parallelization efforts, we introduce the concept of loop-centric profiling to provide a hierarchical view of how much time is spent in a loop and the loops nested within it.This paper introduces two techniques for loop profiling. First, we describe an instrumentation-based approach that gathers highly detailed and accurate information about loop behavior. Second, we present a sampling approach that achieves similar results with negligible overhead. The paper concludes with a case study evaluating the tool on several SPEC 2000 benchmarks.