Efficient synchronization: let them eat QOLB

Authors:
Alain Kägi;Doug Burger;James R. Goodman
Affiliations:
Computer Sciences Department, University of Wisconsin-Madison, 1210 West Dayton Street, Madison, Wisconsin;Computer Sciences Department, University of Wisconsin-Madison, 1210 West Dayton Street, Madison, Wisconsin;Computer Sciences Department, University of Wisconsin-Madison, 1210 West Dayton Street, Madison, Wisconsin
Venue:
Proceedings of the 24th annual international symposium on Computer architecture
Year:
1997

Citing 31
Cited 38

The Wisconsin multicube: a new large-scale cache-coherent multiprocessor

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Efficient synchronization primitives for large-scale cache-coherent multiprocessors

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
The performance implications of thread management alternatives for shared-memory multiprocessors

SIGMETRICS '89 Proceedings of the 1989 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Synchronization Algorithms for Shared-Memory Multiprocessors

Computer
Algorithms for scalable synchronization on shared-memory multiprocessors

ACM Transactions on Computer Systems (TOCS)
Synchronization without contention

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The Stanford Dash Multiprocessor

Computer
SPLASH: Stanford parallel applications for shared-memory

ACM SIGARCH Computer Architecture News
Performance of the SCI ring

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
The network architecture of the Connection Machine CM-5 (extended abstract)

SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
The Wisconsin Wind Tunnel: virtual prototyping of parallel computers

SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Tempest and typhoon: user-level shared memory

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Reactive synchronization algorithms for multiprocessors

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Fine-grain access control for distributed shared memory

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
EEL: machine-independent executable editing

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Techniques for reducing overheads of shared-memory multiprocessing

ICS '95 Proceedings of the 9th international conference on Supercomputing
Decoupled hardware support for distributed shared memory

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Coherent network interfaces for fine-grain communication

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
STiNG: a CC-NUMA computer system for the commercial marketplace

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Shasta: a low overhead, software-only approach for supporting fine-grain shared memory

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Weak ordering—a new definition

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Memory consistency and event ordering in scalable shared-memory multiprocessors

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
A fast algorithm for finding dominators in a flowgraph

ACM Transactions on Programming Languages and Systems (TOPLAS)
Architecture of the VPP500 parallel supercomputer

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Myrinet: A Gigabit-per-Second Local Area Network

IEEE Micro
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Accuracy vs. performance in parallel simulation of interconnection networks

IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
Dynamic decentralized cache schemes for mimd parallel processors

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
Efficient Software Synchronization on Large Cache Coherent Multiprocessors

Efficient Software Synchronization on Large Cache Coherent Multiprocessors

Evaluating synchronization on shared address space multiprocessors: methodology and performance

SIGMETRICS '99 Proceedings of the 1999 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A quantitative architectural evaluation of synchronization algorithms and disciplines on ccNUMA systems: the case of the SGI Origin2000

ICS '99 Proceedings of the 13th international conference on Supercomputing
A high-level abstraction of shared accesses

ACM Transactions on Computer Systems (TOCS)
Exploiting Network Locality for CC-NUMA Multiprocessors

The Journal of Supercomputing
A system-on-a-chip lock cache with task preemption support

CASES '01 Proceedings of the 2001 international conference on Compilers, architecture, and synthesis for embedded systems
Speculative lock elision: enabling highly concurrent multithreaded execution

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Integrating non-blocking synchronisation in parallel applications: performance advantages and methodologies

WOSP '02 Proceedings of the 3rd international workshop on Software and performance
An Application-Driven Study of Multicast Communication for Write Invalidation

The Journal of Supercomputing
Transactional lock-free execution of lock-based programs

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
The Architectural and Operating System Implications on the Performance of Synchronization on ccNUMA Multiprocessors

International Journal of Parallel Programming
Efficient synchronization for nonuniform communication architectures

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Improving server software support for simultaneous multithreaded processors

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Inferential queueing and speculative push for reducing critical communication latencies

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Hierarchical Backoff Locks for Nonuniform Communication Architectures

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Analysis of Shared Memory Misses and Reference Patterns

ICCD '00 Proceedings of the 2000 IEEE International Conference on Computer Design: VLSI in Computers & Processors
The Thread-Based Protocol Engines for CC-NUMA Multiprocessors

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
The Impact of Negative Acknowledgments in Shared Memory Scientific Applications

IEEE Transactions on Parallel and Distributed Systems
Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance

IEEE Transactions on Knowledge and Data Engineering
A methodology for detailed performance modeling of reduction computations on SMP machines

Performance Evaluation - Performance modelling and evaluation of high-performance parallel and distributed systems
Inferential queueing and speculative push

International Journal of Parallel Programming - Special issue I: The 17th annual international conference on supercomputing (ICS'03)
Power/performance hardware optimization for synchronization intensive applications in MPSoCs

Proceedings of the conference on Design, automation and test in Europe: Proceedings
An efficient synchronization technique for multiprocessor systems on-chip

MEDEA '05 Proceedings of the 2005 workshop on MEmory performance: DEaling with Applications , systems and architecture
Circulating shared-registers for multiprocessor systems

Journal of Systems Architecture: the EUROMICRO Journal
Quantitative performance analysis of the SPEC OMPM2001 benchmarks

Scientific Programming - OpenMP
Efficient self-tuning spin-locks using competitive analysis

Journal of Systems and Software
Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures

Proceedings of the 34th annual international symposium on Computer architecture
Light-weight synchronization for inter-processor communication acceleration on embedded MPSoCs

CASES '07 Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems
Synchronization coherence: A transparent hardware mechanism for cache coherence and fine-grained synchronization

Journal of Parallel and Distributed Computing
Extending concurrency of transactional memory programs by using value prediction

Proceedings of the 6th ACM conference on Computing frontiers
Limited early value communication to improve performance of transactional memory

Proceedings of the 23rd international conference on Supercomputing
A Methodology to Characterize Critical Section Bottlenecks in DSM Multiprocessors

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Flexible architectural support for fine-grain scheduling

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Smartlocks: lock acquisition scheduling for self-aware synchronization

Proceedings of the 7th international conference on Autonomic computing
Architectural Support for Fair Reader-Writer Locking

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Efficient synchronization for embedded on-chip multiprocessors

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Speeding-up synchronizations in DSM multiprocessors

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Support for fine-grained synchronization in shared-memory multiprocessors

PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies
DeNovoND: efficient hardware support for disciplined non-determinism

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Efficient synchronization primitives are essential for achieving high performance in fine-grain, shared-memory parallel programs. One function of synchronization primitives is to enable exclusive access to shared data and critical sections of code. This paper makes three contributions. (1) We enumerate the five sources of overhead that locking synchronization primitives can incur. (2) We describe four mechanisms (local spinning, queue-based locking, collocation, and synchronized prefetch) that reduce these synchronization overheads. (3) With detailed simulations, we show the extent to which these four mechanisms can improve the performance of shared-memory programs. We evaluate the space of these mechanisms using seventeen synchronization constructs, which are formed from six base typed of locks (TEST&SET, TEST&TEST&SET, MCS, LH, M, and QOLB). We show that large performance gains (speedups of more than 1.5 for three of five benchmarks) can be achieved if at least three optimizing mechanisms are used simultaneously. We find that QOLB, which incorporates all four mechanisms, outperforms all other primitives (including reactive synchronization) in all cases. Finally, we demonstrate the superior performance of a low-cost implementation of QOLB, which runs on an unmodified cluster of commodity workstations.