Analyzing lock contention in multithreaded applications

Authors:
Nathan R. Tallent;John M. Mellor-Crummey;Allan Porterfield
Affiliations:
Rice University, Houston, TX, USA;Rice University, Houston, TX, USA;Renaissance Computing Institute, Chapel Hill, NC, USA
Venue:
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Year:
2010

Citing 18
Cited 12

Quartz: a tool for tuning parallel program performance

SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Algorithms for scalable synchronization on shared-memory multiprocessors

ACM Transactions on Computer Systems (TOCS)
Experience with a performance analyzer for multithreaded applications

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Call path profiling

ICSE '92 Proceedings of the 14th international conference on Software engineering
Programming with POSIX threads

Programming with POSIX threads
ProfileMe: hardware support for instruction-level profiling on out-of-order processors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
The implementation of the Cilk-5 multithreaded language

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Thin locks: featherweight synchronization for Java

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Advanced contention management for dynamic software transactional memory

Proceedings of the twenty-fourth annual ACM symposium on Principles of distributed computing
Low-overhead call path profiling of unmodified, optimized code

Proceedings of the 19th annual international conference on Supercomputing
Understanding Tradeoffs in Software Transactional Memory

Proceedings of the International Symposium on Code Generation and Optimization
Transactional memory

Communications of the ACM - Web science
Effective performance measurement and analysis of multithreaded applications

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Binary analysis for measurement and attribution of program performance

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
UTS: an unbalanced tree search benchmark

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
Multiresolution quantum chemistry in multiwavelet bases

ICCS'03 Proceedings of the 2003 international conference on Computational science
Design and implementation of the HPCS graph analysis benchmark on symmetric multiprocessors

HiPC'05 Proceedings of the 12th international conference on High Performance Computing

Scalable Identification of Load Imbalance in Parallel Executions Using Call Path Profiles

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
LIME: a framework for debugging load imbalance in multi-threaded execution

Proceedings of the 33rd International Conference on Software Engineering
Scalable fine-grained call path tracing

Proceedings of the international conference on Supercomputing
The runtime abort graph and its application to software transactional memory optimization

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
HaLock: hardware-assisted lock contention detection in multithreaded applications

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Critical lock analysis: diagnosing critical section bottlenecks in multithreaded applications

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Chronos: predictable low latency for data center applications

Proceedings of the Third ACM Symposium on Cloud Computing
A new approach for performance analysis of openMP programs

Proceedings of the 27th international ACM conference on International conference on supercomputing
Criticality stacks: identifying critical threads in parallel programs using synchronization behavior

Proceedings of the 40th Annual International Symposium on Computer Architecture
Toddler: detecting performance problems via similar memory-access patterns

Proceedings of the 2013 International Conference on Software Engineering
Effective sampling-driven performance tools for GPU-accelerated supercomputers

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Comprehending performance from real-world execution traces: a device-driver case

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many programs exploit shared-memory parallelism using multithreading. Threaded codes typically use locks to coordinate access to shared data. In many cases, contention for locks reduces parallel efficiency and hurts scalability. Being able to quantify and attribute lock contention is important for understanding where a multithreaded program needs improvement. This paper proposes and evaluates three strategies for gaining insight into performance losses due to lock contention. First, we consider using a straightforward strategy based on call stack profiling to attribute idle time and show that it fails to yield insight into lock contention. Second, we consider an approach that builds on a strategy previously used for analyzing idleness in work-stealing computations; we show that this strategy does not yield insight into lock contention. Finally, we propose a new technique for measurement and analysis of lock contention that uses data associated with locks to blame lock holders for the idleness of spinning threads. Our approach incurs ≤ 5% overhead on a quantum chemistry application that makes extensive use of locking (65M distinct locks, a maximum of 340K live locks, and an average of 30K lock acquisitions per second per thread) and attributes lock contention to its full static and dynamic calling contexts. Our strategy, implemented in HPCToolkit, is fully distributed and should scale well to systems with large core counts.