Evaluating the performance of non-blocking synchronization on shared-memory multiprocessors

Authors:
Philippas Tsigas;Yi Zhang
Affiliations:
Department of Computing Science, Chalmers University of Technology, SE-412 96 Göteborg, Sweden;Department of Computing Science, Chalmers University of Technology, SE-412 96 Göteborg, Sweden
Venue:
Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Year:
2001

Citing 1
Cited 12

The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture

Integrating non-blocking synchronisation in parallel applications: performance advantages and methodologies

WOSP '02 Proceedings of the 3rd international workshop on Software and performance
Fast and lock-free concurrent priority queues for multi-thread systems

Journal of Parallel and Distributed Computing
Using wait-free synchronization in the design of distributed applications

Future Generation Computer Systems
Multiword atomic read/write registers on multiprocessor systems

Journal of Experimental Algorithmics (JEA)
Non-blocking programming on multi-core graphics processors: (extended asbtract)

ACM SIGARCH Computer Architecture News
LFTHREADS: a lock-free thread library

ACM SIGARCH Computer Architecture News
Supporting lock-free composition of concurrent data objects

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Using wait-free synchronization in the design of distributed applications

Future Generation Computer Systems
LFTHREADS: a lock-free thread library

OPODIS'07 Proceedings of the 11th international conference on Principles of distributed systems
Supporting lock-free composition of concurrent data objects

Proceedings of the 7th ACM international conference on Computing frontiers
Progress guarantees when composing lock-free objects

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Allocating memory in a lock-free manner

ESA'05 Proceedings of the 13th annual European conference on Algorithms

Quantified Score

Hi-index	0.00

Visualization

Abstract

Parallel programs running on shared memory multiprocessors coordinate via shared data objects/structures. To ensure the consistency of the shared data structures, programs typically rely on some forms of software synchronisations. Unfortunately typical software synchronisation mechanisms usually result in poor performance because they produce large amounts of memory and interconnection network contention and, more significantly, because they produce convoy effects that degrade significantly in multiprogramming environments: if one process holding a lock is preempted, other processes on different processors waiting for the lock will not be able to proceed. Researchers have introduced non-blocking synchronisation to address the above problems. Non-blocking implementations allow multiple tasks to access a shared object at the same time, but without enforcing mutual exclusion to accomplish this. However, its performance implications are not well understood on modern systems or on real applications. In this paper we study the impact of the non-blocking synchronisation on parallel applications running on top of a modern, 64 processor, cache-coherent, shared memory multiprocessor system: the SGI Origin 2000. Cache-coherent non-uniform memory access (ccNUMA) shared memory multiprocessor systems have attracted considerable research and commercial interest in the last years. In addition to the performance results on a modern system, we also investigate the key synchronisation schemes that are used in multiprocessor applications and their efficient transformation to non-blocking ones. Evaluating the impact of the synchronisation performance on applications is important for several reasons. First, micro-benchmarks can not capture every aspect of primitive performance. It is hard to predict the primitive impact on the application performance. For example, a look or barrier that generates a lot of additional network traffic might have little impact on applications. Second, even in applications that spend significant time in synchronisation operations, the synchronisation time might be dominated by wait time due to load imbalance and lock serialisation in the application, which better implementations of synchronisation may not be helpful in reducing. Third, micro-benchmarks rarely capture (generate) scenarios that occur in real applications.We evaluated the benefits of non-blocking synchronisation in a range of applications running on top of modern realizations of shared-memory multiprocessors, a 64 processor SGI Origin 2000. In this evaluation, i) we used a big set of applications with different communication characteristics, making sure that we include also applications that do not spend a lot of time in synchronisation, ii) we also modified all the lock-based synchronisation points of these applications when possible. The goal of our work was to provide an in depth understanding of how non-blocking can improve the performance of modern parallel applications. More specifically, the main issues addressed in this paper include: i) The architectural implications of the ccNUMA on the design of non-blocking synchronisation. ii) The identification of the basic locking operations that parallel programmers use in their applications. iii) The efficient non-blocking implementation of these synchronisation operations. iv) The experimental comparison of the lock-based and lock-free versions of the respective applications on a cache-coherent non-uniform memory access shared memory multiprocessor system. v) The identification of the structural differences between applications that benefit more from non-blocking synchronisation than others. We selected to examine these issues, on a 64 processor SGI Origin 2000 multiprocessor system. This machine is attractive for the study because it provides an aggressive communication architecture and support for both in cache and at memory synchronisation primitives. It should be clear however that the conclusions and the methods presented in this paper have general applicability in other realizations of cache-coherent non-uniform memory access machines. Our results can benefit the parallel programmers in two ways. First, to understand the benefits of non-blocking synchronisation, and then to transform some typical lock-based synchronisation operations that are probably used in their programs to non-blocking ones by using the general translations that we provide in this paper.