Exploiting deferred destruction: an analysis of read-copy-update techniques in operating system kernels

Authors:
Paul E. Mckenney;Jonathan Walpole
Affiliations:
-;-
Venue:
Exploiting deferred destruction: an analysis of read-copy-update techniques in operating system kernels
Year:
2004

Citing 0
Cited 25

Reordering constraints for pthread-style locks

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Performance of memory reclamation for lockless synchronization

Journal of Parallel and Distributed Computing
Why the grass may not be greener on the other side: a comparison of locking vs. transactional memory

Proceedings of the 4th workshop on Programming languages and operating systems
Introducing technology into the Linux kernel: a case study

ACM SIGOPS Operating Systems Review - Research and developments in the Linux kernel
Garbage collection in the next C++ standard

Proceedings of the 2009 international symposium on Memory management
Operating System Transactions

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
The read-copy-update mechanism for supporting real-time applications on shared-memory multiprocessor systems with Linux

IBM Systems Journal
NOrec: streamlining STM by abolishing ownership records

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Why the grass may not be greener on the other side: a comparison of locking vs. transactional memory

ACM SIGOPS Operating Systems Review
Scalable concurrent hash tables via relativistic programming

ACM SIGOPS Operating Systems Review
Transactional mutex locks

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Synchronization for fast and reentrant operating system kernel tracing

Software—Practice & Experience - Focus on Selected PhD Literature Reviews in the Practical Aspects of Software Technology
Making lockless synchronization fast: performance implications of memory reclamation

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A relativistic enhancement to software transactional memory

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Resizable, scalable, concurrent hash tables via relativistic programming

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Scalable address spaces using RCU balanced trees

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Can seqlocks get along with programming language memory models?

Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
Position paper: nondeterminism is unavoidable, but data races are pure evil

Proceedings of the 2012 ACM workshop on Relaxing synchronization for multicore and manycore scalability
A case for relativistic programming

Proceedings of the 2012 ACM workshop on Relaxing synchronization for multicore and manycore scalability
Lockless multi-core high-throughput buffering scheme for kernel tracing

ACM SIGOPS Operating Systems Review
Verifying concurrent memory reclamation algorithms with grace

ESOP'13 Proceedings of the 22nd European conference on Programming Languages and Systems
Nonblocking algorithms and scalable multicore programming

Communications of the ACM
Nonblocking Algorithms and Scalable Multicore Programming

Queue - Concurrency
MSWIM demo abstract: direct code execution: increase simulation realism using unmodified real implementations

Proceedings of the 11th ACM international symposium on Mobility management and wireless access
Towards a scalable microkernel personality for multicore processors

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing

Quantified Score

Hi-index	0.02

Visualization

Abstract

The Moore's-Law-driven performance of simple instructions has improved by orders of magnitude over the past two decades, but shared-memory multiprocessor (SMMP) synchronization operations have not kept pace. SMMP software uses synchronization operations heavily, thus suffering degraded performance and scalability. As a result, many traditional SMMP algorithms are now obsolete. This dissertation presents read-copy update (RCU), a reader-writer synchronization mechanism in which read-side critical sections incur virtually zero synchronization overhead, thereby achieving near-ideal performance for read-mostly workloads. Write-side critical sections incur substantial synchronization overhead, deferring destruction and maintaining multiple versions of data structures in order to accommodate the synchronization-free read-side critical sections. In addition, writers use some mechanism, such as locking, to ensure orderly updates. Readers provide a signal enabling writers to determine when it is safe to complete destructive operations, but this signal may be deferred, permitting a single signal operation to serve multiple read-side RCU critical sections. These read-side signals are observed by a specialized garbage collector, which carries out destructive operations once it is safe to do so. Garbage collectors are typically implemented in a manner similar to a barrier computation. Production-quality garbage collectors batch destructive operations, amortizing signal-observation overhead over many updates. Although RCU is not itself new, its use has been quite specialized. This dissertation rectifies this situation by showing how RCU can be implemented efficiently in operating system kernels, by demonstrating its system-level performance and complexity benefits, and by providing a set of design patterns that make RCU more generally applicable. This dissertation compares RCU to traditional synchronization mechanisms, including locking and non-blocking synchronization, using both analytic and empirical methods. The empirical methods include both informal micro-benchmarks and formal system-level benchmarks. These benchmarks show performance benefits ranging from tens of percent to an order of magnitude and little or no increase in code complexity. Finally, this dissertation demonstrates that RCU has practical value by (1) outlining its use in several production systems, two of which have seen extensive datacenter use, one of which this author designed and implemented, and (2) documenting its widespread use in the Linux 2.6 kernel.