Quartz: a tool for tuning parallel program performance
SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
ICSE '92 Proceedings of the 14th international conference on Software engineering
Exploiting hardware performance counters with flow and context sensitive profiling
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Programming with POSIX threads
Programming with POSIX threads
The implementation of the Cilk-5 multithreaded language
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Parallel performance prediction using lost cycles analysis
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
An Efficient Online Path Profiling Framework for Java Just-In-Time Compilers
Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
Low-overhead call path profiling of unmodified, optimized code
Proceedings of the 19th annual international conference on Supercomputing
Portable and accurate sampling profiling for Java
Software—Practice & Experience - Research Articles
Accurate, efficient, and adaptive calling context profiling
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
A performance counter architecture for computing accurate CPI components
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Identifying potential parallelism via loop-centric profiling
Proceedings of the 4th international conference on Computing frontiers
Power/Performance/Thermal Design-Space Exploration for Multicore Architectures
IEEE Transactions on Parallel and Distributed Systems
Intel threading building blocks
Intel threading building blocks
Providing Observability for OpenMP 3.0 Applications
IWOMP '09 Proceedings of the 5th International Workshop on OpenMP: Evolving OpenMP in an Age of Extreme Parallelism
Diagnosing performance bottlenecks in emerging petascale applications
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Analyzing lock contention in multithreaded applications
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
The Cilkview scalability analyzer
Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Scalable Identification of Load Imbalance in Parallel Executions Using Call Path Profiles
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
THOR: a performance analysis tool for java applications running on multicore systems
IBM Journal of Research and Development
LIME: a framework for debugging load imbalance in multi-threaded execution
Proceedings of the 33rd International Conference on Software Engineering
Automatic performance debugging of SPMD-style parallel programs
Journal of Parallel and Distributed Computing
Kremlin: rethinking and rebooting gprof for the multicore age
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Scalable fine-grained call path tracing
Proceedings of the international conference on Supercomputing
Performance-aware multicore programming
Proceedings of the 49th Annual Southeast Regional Conference
Understanding stencil code performance on multicore architectures
Proceedings of the 8th ACM International Conference on Computing Frontiers
Kismet: parallel speedup estimates for serial programs
Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Workflow overhead analysis and optimizations
Proceedings of the 6th workshop on Workflows in support of large-scale science
A balanced approach to application performance tuning
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Achieving application-centric performance targets via consolidation on multicores: myth or reality?
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Harmony: collection and analysis of parallel block vectors
Proceedings of the 39th Annual International Symposium on Computer Architecture
Characterizing and mitigating work time inflation in task parallel programs
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A new approach for performance analysis of openMP programs
Proceedings of the 27th international ACM conference on International conference on supercomputing
Parallelism profiling and wall-time prediction for multi-threaded applications
Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering
Comprehending performance from real-world execution traces: a device-driver case
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Characterizing and mitigating work time inflation in task parallel programs
Scientific Programming - Selected Papers from Super Computing 2012
Hi-index | 0.00 |
Understanding why the performance of a multithreaded program does not improve linearly with the number of cores in a shared-memory node populated with one or more multicore processors is a problem of growing practical importance. This paper makes three contributions to performance analysis of multithreaded programs. First, we describe how to measure and attribute parallel idleness, namely, where threads are stalled and unable to work. This technique applies broadly to programming models ranging from explicit threading (e.g., Pthreads) to higher-level models such as Cilk and OpenMP. Second, we describe how to measure and attribute parallel overhead -- when a thread is performing miscellaneous work other than executing the user's computation. By employing a combination of compiler support and post-mortem analysis, we incur no measurement cost beyond normal profiling to glean this information. Using idleness and overhead metrics enables one to pinpoint areas of an application where concurrency should be increased (to reduce idleness), decreased (to reduce overhead), or where the present parallelization is hopeless (where idleness and overhead are both high). Third, we describe how to measure and attribute arbitrary performance metrics for high-level multithreaded programming models, such as Cilk. This requires bridging the gap between the expression of logical concurrency in programs and its realization at run-time as it is adaptively partitioned and scheduled onto a pool of threads. We have prototyped these ideas in the context of Rice University's HPCToolkit performance tools. We describe our approach, implementation, and experiences applying this approach to measure and attribute work, idleness, and overhead in executions of Cilk programs.