Integrating non-blocking synchronisation in parallel applications: performance advantages and methodologies

Authors:
Philippas Tsigas;Yi Zhang
Affiliations:
Chalmers University of Technology, Sweden;Chalmers University of Technology, Sweden
Venue:
WOSP '02 Proceedings of the 3rd international workshop on Software and performance
Year:
2002

Citing 22
Cited 12

Wait-free synchronization

ACM Transactions on Programming Languages and Systems (TOPLAS)
Algorithms for scalable synchronization on shared-memory multiprocessors

ACM Transactions on Computer Systems (TOCS)
A rapid hierarchical radiosity algorithm

Proceedings of the 18th annual conference on Computer graphics and interactive techniques
SPLASH: Stanford parallel applications for shared-memory

ACM SIGARCH Computer Architecture News
Volume rendering on scalable shared-memory MIMD architectures

VVS '92 Proceedings of the 1992 workshop on Volume visualization
Working sets, cache sizes, and node granularity issues for large-scale multiprocessors

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Parallel Visualization Algorithms: Performance and Architectural Implications

Computer
Reactive synchronization algorithms for multiprocessors

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The performance advantages of integrating block data transfer in cache-coherent multiprocessors

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
STiNG: a CC-NUMA computer system for the commercial marketplace

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Efficient synchronization: let them eat QOLB

Proceedings of the 24th annual international symposium on Computer architecture
The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture
A methodology and an evaluation of the SGI Origin2000

SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Empirical studies of competitve spinning for a shared-memory multiprocessor

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
Nonblocking algorithms and preemption-safe locking on multiprogrammed shared memory multiprocessors

Journal of Parallel and Distributed Computing
A quantitative architectural evaluation of synchronization algorithms and disciplines on ccNUMA systems: the case of the SGI Origin2000

ICS '99 Proceedings of the 13th international conference on Supercomputing
A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Evaluating the performance of non-blocking synchronization on shared-memory multiprocessors

Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
The MIPS R10000 Superscalar Microprocessor

IEEE Micro
The Effect of Scheduling Discipline on Spin Overhead in Shared Memory Parallel Systems

IEEE Transactions on Parallel and Distributed Systems
The DASH Prototype: Logic Overhead and Performance

IEEE Transactions on Parallel and Distributed Systems

Scalable and lock-free concurrent dictionaries

Proceedings of the 2004 ACM symposium on Applied computing
Fast and lock-free concurrent priority queues for multi-thread systems

Journal of Parallel and Distributed Computing
Multiword atomic read/write registers on multiprocessor systems

Journal of Experimental Algorithmics (JEA)
On the design and implementation of a shared memory dispatcher for partially clairvoyant schedulers

International Journal of Parallel Programming
Non-blocking programming on multi-core graphics processors: (extended asbtract)

ACM SIGARCH Computer Architecture News
NOBLE: non-blocking programming support via lock-free shared abstract data types

ACM SIGARCH Computer Architecture News
Supporting lock-free composition of concurrent data objects

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Supporting lock-free composition of concurrent data objects

Proceedings of the 7th ACM international conference on Computing frontiers
Dynamic lock synchronisation for collaborative 3D applications

Proceedings of the 7th International Conference on Advances in Mobile Computing and Multimedia
Progress guarantees when composing lock-free objects

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Allocating memory in a lock-free manner

ESA'05 Proceedings of the 13th annual European conference on Algorithms
Effective use of non-blocking data structures in a deduplication application

Proceedings of the 2013 companion publication for conference on Systems, programming, & applications: software for humanity

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we investigate how performance and speedup of applications would be affected by using non-blocking rather than blocking synchronisation in parallel systems. The results obtained show that for many applications, non-blocking synchronisation lead to significant speedups for a fairly large number of processors, while it never slows the applications down. As part of this investigation this paper also provides a set of efficient and simple translations that show how typical blocking operations found in parallel applications, such as simple locks, queues and lock trees can be translated into non-blocking equivalents that use hardware primitives common in modern multiprocessor systems. With these translations this paper clearly demonstrates that it is easy for the application designer/programmer to replace the blocking operations commonly found on with non-blocking equivalents ones. For the empirical results a set of representative applications running on a large-scale ccNUMA machine were used.