Whose cache line is it anyway?: operating system support for live detection and repair of false sharing

Authors:
Mihir Nanavati;Mark Spear;Nathan Taylor;Shriram Rajagopalan;Dutch T. Meyer;William Aiello;Andrew Warfield
Affiliations:
University of British Columbia;University of British Columbia;University of British Columbia;University of British Columbia;University of British Columbia;University of British Columbia;University of British Columbia
Venue:
Proceedings of the 8th ACM European Conference on Computer Systems
Year:
2013

Citing 28
Cited 1

Fine-grained dynamic instrumentation of commodity operating system kernels

OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
Efficient and flexible value sampling

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
A Universal Dynamic Trace for Linux and Other Operating Systems

Proceedings of the FREENIX Track: 2001 USENIX Annual Technical Conference
An infrastructure for adaptive dynamic optimization

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
A low-overhead coherence solution for multiprocessors with private cache memories

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
Xen and the art of virtualization

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Pin: building customized program analysis tools with dynamic instrumentation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
XFI: software guards for system address spaces

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
JIT instrumentation: a novel approach to dynamically instrument operating systems

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
False sharing and its effect on shared memory performance

Sedms'93 USENIX Systems on USENIX Experiences with Distributed and Multiprocessor Systems - Volume 4
Evaluating MapReduce for Multi-core and Multiprocessor Systems

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Vx32: lightweight user-level sandboxing on the x86

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Native Client: A Sandbox for Portable, Untrusted x86 Native Code

SP '09 Proceedings of the 2009 30th IEEE Symposium on Security and Privacy
The multikernel: a new OS architecture for scalable multicore systems

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Locating cache performance bottlenecks using data profiling

Proceedings of the 5th European conference on Computer systems
Assessing cache false sharing effects by dynamic binary instrumentation

Proceedings of the Workshop on Binary Instrumentation and Applications
An analysis of Linux scalability to many cores

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Deterministic process groups in dOS

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Dynamic cache contention detection in multi-threaded applications

Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Computer Architecture, Fifth Edition: A Quantitative Approach

Computer Architecture, Fifth Edition: A Quantitative Approach
Demand-driven software race detection using hardware performance counters

Proceedings of the 38th annual international symposium on Computer architecture
Anywhere, any-time binary instrumentation

Proceedings of the 10th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools
SHERIFF: precise detection and automatic mitigation of false sharing

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
A case for unlimited watchpoints

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Aikido: accelerating shared data dynamic analyses

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
CacheIn: a toolset for comprehensive cache inspection

ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part II
Why on-chip cache coherence is here to stay

Communications of the ACM

PREDATOR: predictive false sharing detection

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

As hardware parallelism continues to increase, CPU caches can no longer be considered as a transparent, hardware-level performance optimization. Cache impact on performance, in particular in the face of false sharing, is completely dependent on the software that is executing. To effectively support parallel workloads on cache coherent hardware, the operating system must begin to treat the CPU cache like other shared hardware resources, and manage it appropriately. We demonstrate a prototype example of such support by describing Plastic, a software-based system that detects, diagnoses, and transparently repairs false sharing as it occurs in running applications. Plastic solves two challenging problems. First, it is capable of rapid, low-overhead detection and diagnosis of false sharing in unmodified, running applications. Second, it resolves identified instances of false sharing by providing a sub-page granularity memory remapping facility within the system. Our implementation is capable of identifying and repairing pathological false sharing in under one second of execution and achieves speedups of 3-6x on known examples of false sharing in parallel benchmarks.