Distributing Hot-Spot Addressing in Large-Scale Multiprocessors
IEEE Transactions on Computers
Efficient synchronization primitives for large-scale cache-coherent multiprocessors
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Linearizability: a correctness condition for concurrent objects
ACM Transactions on Programming Languages and Systems (TOPLAS)
ACM Transactions on Programming Languages and Systems (TOPLAS)
Algorithms for scalable synchronization on shared-memory multiprocessors
ACM Transactions on Computer Systems (TOCS)
SIGMETRICS '91 Proceedings of the 1991 ACM SIGMETRICS conference on Measurement and modeling of computer systems
ACM Transactions on Computer Systems (TOCS)
Simple, fast, and practical non-blocking and blocking concurrent queue algorithms
PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
Combining funnels: a dynamic approach to software combining
Journal of Parallel and Distributed Computing
Hoard: a scalable memory allocator for multithreaded applications
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Queue Locks on Cache Coherent Multiprocessors
Proceedings of the 8th International Symposium on Parallel Processing
Hierarchical Backoff Locks for Nonuniform Communication Architectures
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
A scalable lock-free stack algorithm
Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures
Validity of the single processor approach to achieving large scale computing capabilities
AFIPS '67 (Spring) Proceedings of the April 18-20, 1967, spring joint computer conference
Flat combining and the synchronization-parallelism tradeoff
Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
A highly-efficient wait-free universal construction
Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Fast concurrent queues for x86 processors
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Reducing contention through priority updates
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Brief announcement: an asymmetric flat-combining based queue algorithm
Proceedings of the 2013 ACM symposium on Principles of distributed computing
Reducing contention through priority updates
Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Can lock-free and combining techniques co-exist?: a novel approach on concurrent queue
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Lightweight contention management for efficient compare-and-swap operations
Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Leveraging hardware message passing for efficient thread synchronization
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Hi-index | 0.00 |
Fine-grain thread synchronization has been proved, in several cases, to be outperformed by efficient implementations of the combining technique where a single thread, called the combiner, holding a coarse-grain lock, serves, in addition to its own synchronization request, active requests announced by other threads while they are waiting by performing some form of spinning. Efficient implementations of this technique significantly reduce the cost of synchronization, so in many cases they exhibit much better performance than the most efficient finely synchronized algorithms. In this paper, we revisit the combining technique with the goal to discover where its real performance power resides and whether or how ensuring some desired properties (e.g., fairness in serving requests) would impact performance. We do so by presenting two new implementations of this technique; the first (CC-Synch) addresses systems that support coherent caches, whereas the second (DSM-Synch) works better in cacheless NUMA machines. In comparison to previous such implementations, the new implementations (1) provide bounds on the number of remote memory references (RMRs) that they perform, (2) support a stronger notion of fairness, and (3) use simpler and less basic primitives than previous approaches. In all our experiments, the new implementations outperform by far all previous state-of-the-art combining-based and fine-grain synchronization algorithms. Our experimental analysis sheds light to the questions we aimed to answer. Several modern multi-core systems organize the cores into clusters and provide fast communication within the same cluster and much slower communication across clusters. We present an hierarchical version of CC-Synch, called H-Synch which exploits the hierarchical communication nature of such systems to achieve better performance. Experiments show that H-Synch significantly outper forms previous state-of-the-art hierarchical approaches. We provide new implementations of common shared data structures (like stacks and queues) based on CC-Synch, DSM-Synch and H-Synch. Our experiments show that these implementations outperform by far all previous (fine-grain or combined-based) implementations of shared stacks and queues.