Dynamic cache contention detection in multi-threaded applications

Authors:
Qin Zhao;David Koh;Syed Raza;Derek Bruening;Weng-Fai Wong;Saman Amarasinghe
Affiliations:
Massachusetts Institute of Technology, Cambridge, MA, USA;Massachusetts Institute of Technology, Cambridge, MA, USA;Massachusetts Institute of Technology, Cambridge, MA, USA;Google Inc, Mountain View, CA, USA;National University of Singapore, Singapore, Singapore;Massachusetts Institute of Technology, Cambridge, MA, USA
Venue:
Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Year:
2011

Citing 31
Cited 10

Minimum Distance: A Method for Partitioning Recurrences for Multiprocessors

IEEE Transactions on Computers
Implementation and performance of Munin

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
The detection and elimination of useless misses in multiprocessors

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Reducing false sharing on shared memory multiprocessors through compile time data transformations

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Eraser: a dynamic data race detector for multithreaded programs

ACM Transactions on Computer Systems (TOCS)
Hoard: a scalable memory allocator for multithreaded applications

ACM SIGPLAN Notices
Reduction of Cache Coherence Overhead by Compiler Data Layout and Loop Transformation

Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Runtime Checking of Multithreaded Applications with Visual Threads

Proceedings of the 7th International SPIN Workshop on SPIN Model Checking and Software Verification
Dynamically Controlling False Sharing in Distributed Shared Memory

HPDC '96 Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing
An Architecture-Independent Analysis of False Sharing

An Architecture-Independent Analysis of False Sharing
Efficient, transparent, and comprehensive runtime code manipulation

Efficient, transparent, and comprehensive runtime code manipulation
Automatic logging of operating system effects to guide application-level architecture simulation

SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
TaintTrace: Efficient Flow Tracing with Dynamic Binary Rewriting

ISCC '06 Proceedings of the 11th IEEE Symposium on Computers and Communications
LIFT: A Low-Overhead Practical Information Flow Tracking System for Detecting Security Attacks

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Using Valgrind to detect undefined value errors with bit-precision

ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Valgrind: a framework for heavyweight dynamic binary instrumentation

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Ubiquitous memory introspection

Proceedings of the International Symposium on Code Generation and Optimization
Operating system scheduling for chip multithreaded processors

Operating system scheduling for chip multithreaded processors
Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
False sharing and its effect on shared memory performance

Sedms'93 USENIX Systems on USENIX Experiences with Distributed and Multiprocessor Systems - Volume 4
Evaluating MapReduce for Multi-core and Multiprocessor Systems

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Thread scheduling for multi-core platforms

HOTOS'07 Proceedings of the 11th USENIX workshop on Hot topics in operating systems
On the Design and Implementation of a Cache-Aware Multicore Real-Time Scheduler

ECRTS '09 Proceedings of the 2009 21st Euromicro Conference on Real-Time Systems
Run-time type checking for binary programs

CC'03 Proceedings of the 12th international conference on Compiler construction
Umbra: efficient and scalable memory shadowing

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
How to do a million watchpoints: efficient debugging using dynamic instrumentation

CC'08/ETAPS'08 Proceedings of the Joint European Conferences on Theory and Practice of Software 17th international conference on Compiler construction
Assessing cache false sharing effects by dynamic binary instrumentation

Proceedings of the Workshop on Binary Instrumentation and Applications
Efficient memory shadowing for 64-bit architectures

Proceedings of the 2010 international symposium on Memory management
CacheIn: a toolset for comprehensive cache inspection

ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part II
Latencies of conflicting writes on contemporary multicore architectures

PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies

SHERIFF: precise detection and automatic mitigation of false sharing

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Transparent dynamic instrumentation

VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments
Compiling for niceness: mitigating contention for QoS in warehouse scale computers

Proceedings of the Tenth International Symposium on Code Generation and Optimization
SPIRE: improving dynamic binary translation through SPC-indexed indirect branch redirecting

Proceedings of the 9th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Whose cache line is it anyway?: operating system support for live detection and repair of false sharing

Proceedings of the 8th ACM European Conference on Computer Systems
Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers

Proceedings of the 40th Annual International Symposium on Computer Architecture
Detection of false sharing using machine learning

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Fast dynamic binary translation for the kernel

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
PREDATOR: predictive false sharing detection

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

In today's multi-core systems, cache contention due to true and false sharing can cause unexpected and significant performance degradation. A detailed understanding of a given multi-threaded application's behavior is required to precisely identify such performance bottlenecks. Traditionally, however, such diagnostic information can only be obtained after lengthy simulation of the memory hierarchy. In this paper, we present a novel approach that efficiently analyzes interactions between threads to determine thread correlation and detect true and false sharing. It is based on the following key insight: although the slowdown caused by cache contention depends on factors including the thread-to-core binding and parameters of the memory hierarchy, the amount of data sharing is primarily a function of the cache line size and application behavior. Using memory shadowing and dynamic instrumentation, we implemented a tool that obtains detailed sharing information between threads without simulating the full complexity of the memory hierarchy. The runtime overhead of our approach --- a 5x slowdown on average relative to native execution --- is significantly less than that of detailed cache simulation. The information collected allows programmers to identify the degree of cache contention in an application, the correlation among its threads, and the sources of significant false sharing. Using our approach, we were able to improve the performance of some applications up to a factor of 12x. For other contention-intensive applications, we were able to shed light on the obstacles that prevent their performance from scaling to many cores.