Effective performance measurement and analysis of multithreaded applications

Authors:
Nathan R. Tallent;John M. Mellor-Crummey
Affiliations:
Rice University, Houston, TX, USA;Rice University, Houston, TX, USA
Venue:
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
2009

Citing 14
Cited 23

Quartz: a tool for tuning parallel program performance

SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Call path profiling

ICSE '92 Proceedings of the 14th international conference on Software engineering
Exploiting hardware performance counters with flow and context sensitive profiling

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Programming with POSIX threads

Programming with POSIX threads
The implementation of the Cilk-5 multithreaded language

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Parallel performance prediction using lost cycles analysis

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
An Efficient Online Path Profiling Framework for Java Just-In-Time Compilers

Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
Low-overhead call path profiling of unmodified, optimized code

Proceedings of the 19th annual international conference on Supercomputing
Portable and accurate sampling profiling for Java

Software—Practice & Experience - Research Articles
Accurate, efficient, and adaptive calling context profiling

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
A performance counter architecture for computing accurate CPI components

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Identifying potential parallelism via loop-centric profiling

Proceedings of the 4th international conference on Computing frontiers
Power/Performance/Thermal Design-Space Exploration for Multicore Architectures

IEEE Transactions on Parallel and Distributed Systems
Intel threading building blocks

Intel threading building blocks

Providing Observability for OpenMP 3.0 Applications

IWOMP '09 Proceedings of the 5th International Workshop on OpenMP: Evolving OpenMP in an Age of Extreme Parallelism
Diagnosing performance bottlenecks in emerging petascale applications

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Analyzing lock contention in multithreaded applications

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
The Cilkview scalability analyzer

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Scalable Identification of Load Imbalance in Parallel Executions Using Call Path Profiles

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
THOR: a performance analysis tool for java applications running on multicore systems

IBM Journal of Research and Development
LIME: a framework for debugging load imbalance in multi-threaded execution

Proceedings of the 33rd International Conference on Software Engineering
Automatic performance debugging of SPMD-style parallel programs

Journal of Parallel and Distributed Computing
Kremlin: rethinking and rebooting gprof for the multicore age

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Scalable fine-grained call path tracing

Proceedings of the international conference on Supercomputing
Performance-aware multicore programming

Proceedings of the 49th Annual Southeast Regional Conference
Understanding stencil code performance on multicore architectures

Proceedings of the 8th ACM International Conference on Computing Frontiers
Kismet: parallel speedup estimates for serial programs

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Workflow overhead analysis and optimizations

Proceedings of the 6th workshop on Workflows in support of large-scale science
A balanced approach to application performance tuning

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Achieving application-centric performance targets via consolidation on multicores: myth or reality?

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Harmony: collection and analysis of parallel block vectors

Proceedings of the 39th Annual International Symposium on Computer Architecture
Characterizing and mitigating work time inflation in task parallel programs

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A new approach for performance analysis of openMP programs

Proceedings of the 27th international ACM conference on International conference on supercomputing
Parallelism profiling and wall-time prediction for multi-threaded applications

Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering
Comprehending performance from real-world execution traces: a device-driver case

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Characterizing and mitigating work time inflation in task parallel programs

Scientific Programming - Selected Papers from Super Computing 2012
What to expect when you are consolidating: effective prediction models of application performance on multicores

Cluster Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Understanding why the performance of a multithreaded program does not improve linearly with the number of cores in a shared-memory node populated with one or more multicore processors is a problem of growing practical importance. This paper makes three contributions to performance analysis of multithreaded programs. First, we describe how to measure and attribute parallel idleness, namely, where threads are stalled and unable to work. This technique applies broadly to programming models ranging from explicit threading (e.g., Pthreads) to higher-level models such as Cilk and OpenMP. Second, we describe how to measure and attribute parallel overhead -- when a thread is performing miscellaneous work other than executing the user's computation. By employing a combination of compiler support and post-mortem analysis, we incur no measurement cost beyond normal profiling to glean this information. Using idleness and overhead metrics enables one to pinpoint areas of an application where concurrency should be increased (to reduce idleness), decreased (to reduce overhead), or where the present parallelization is hopeless (where idleness and overhead are both high). Third, we describe how to measure and attribute arbitrary performance metrics for high-level multithreaded programming models, such as Cilk. This requires bridging the gap between the expression of logical concurrency in programs and its realization at run-time as it is adaptively partitioned and scheduled onto a pool of threads. We have prototyped these ideas in the context of Rice University's HPCToolkit performance tools. We describe our approach, implementation, and experiences applying this approach to measure and attribute work, idleness, and overhead in executions of Cilk programs.