Revisiting the combining synchronization technique

Authors:
Panagiota Fatourou;Nikolaos D. Kallimanis
Affiliations:
Department of Computer Science University of Crete & FORTH ICS, Heraklion, Greece;Department of Computer Science University of Ioannina, Ioannina, Greece
Venue:
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Year:
2012

Citing 18
Cited 9

Distributing Hot-Spot Addressing in Large-Scale Multiprocessors

IEEE Transactions on Computers
Efficient synchronization primitives for large-scale cache-coherent multiprocessors

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Linearizability: a correctness condition for concurrent objects

ACM Transactions on Programming Languages and Systems (TOPLAS)
Wait-free synchronization

ACM Transactions on Programming Languages and Systems (TOPLAS)
Algorithms for scalable synchronization on shared-memory multiprocessors

ACM Transactions on Computer Systems (TOCS)
The impact of operating system scheduling policies and synchronization methods of performance of parallel applications

SIGMETRICS '91 Proceedings of the 1991 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Diffracting trees

ACM Transactions on Computer Systems (TOCS)
Simple, fast, and practical non-blocking and blocking concurrent queue algorithms

PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
Combining funnels: a dynamic approach to software combining

Journal of Parallel and Distributed Computing
Hoard: a scalable memory allocator for multithreaded applications

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Queue Locks on Cache Coherent Multiprocessors

Proceedings of the 8th International Symposium on Parallel Processing
Hierarchical Backoff Locks for Nonuniform Communication Architectures

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
A scalable lock-free stack algorithm

Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures
Validity of the single processor approach to achieving large scale computing capabilities

AFIPS '67 (Spring) Proceedings of the April 18-20, 1967, spring joint computer conference
Flat combining and the synchronization-parallelism tradeoff

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Flat-combining NUMA locks

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
A highly-efficient wait-free universal construction

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
A hierarchical CLH queue lock

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing

WTTM 2011: the third workshop on the theory of transactional memory

ACM SIGACT News
Speeding up OpenMP tasking

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Fast concurrent queues for x86 processors

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Reducing contention through priority updates

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Brief announcement: an asymmetric flat-combining based queue algorithm

Proceedings of the 2013 ACM symposium on Principles of distributed computing
Reducing contention through priority updates

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Can lock-free and combining techniques co-exist?: a novel approach on concurrent queue

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Lightweight contention management for efficient compare-and-swap operations

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Leveraging hardware message passing for efficient thread synchronization

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Fine-grain thread synchronization has been proved, in several cases, to be outperformed by efficient implementations of the combining technique where a single thread, called the combiner, holding a coarse-grain lock, serves, in addition to its own synchronization request, active requests announced by other threads while they are waiting by performing some form of spinning. Efficient implementations of this technique significantly reduce the cost of synchronization, so in many cases they exhibit much better performance than the most efficient finely synchronized algorithms. In this paper, we revisit the combining technique with the goal to discover where its real performance power resides and whether or how ensuring some desired properties (e.g., fairness in serving requests) would impact performance. We do so by presenting two new implementations of this technique; the first (CC-Synch) addresses systems that support coherent caches, whereas the second (DSM-Synch) works better in cacheless NUMA machines. In comparison to previous such implementations, the new implementations (1) provide bounds on the number of remote memory references (RMRs) that they perform, (2) support a stronger notion of fairness, and (3) use simpler and less basic primitives than previous approaches. In all our experiments, the new implementations outperform by far all previous state-of-the-art combining-based and fine-grain synchronization algorithms. Our experimental analysis sheds light to the questions we aimed to answer. Several modern multi-core systems organize the cores into clusters and provide fast communication within the same cluster and much slower communication across clusters. We present an hierarchical version of CC-Synch, called H-Synch which exploits the hierarchical communication nature of such systems to achieve better performance. Experiments show that H-Synch significantly outper forms previous state-of-the-art hierarchical approaches. We provide new implementations of common shared data structures (like stacks and queues) based on CC-Synch, DSM-Synch and H-Synch. Our experiments show that these implementations outperform by far all previous (fine-grain or combined-based) implementations of shared stacks and queues.