Inferential queueing and speculative push for reducing critical communication latencies

Authors:
Ravi Rajwar;Alain Kägi;James R. Goodman
Affiliations:
Intel Corporation, Hillsboro, OR;Intel Corporation, Hillsboro, OR;University of Wisconsin-Madison, Madison, WI
Venue:
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Year:
2003

Citing 31
Cited 3

Multiprocessor cache synchronization: issues, innovations, evolution

ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Efficient synchronization primitives for large-scale cache-coherent multiprocessors

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Synchronization Algorithms for Shared-Memory Multiprocessors

Computer
Algorithms for scalable synchronization on shared-memory multiprocessors

ACM Transactions on Computer Systems (TOCS)
Tolerating latency through software-controlled prefetching in shared-memory multiprocessors

Journal of Parallel and Distributed Computing - Special issue on shared-memory multiprocessors
The Stanford Dash Multiprocessor

Computer
SPLASH: Stanford parallel applications for shared-memory

ACM SIGARCH Computer Architecture News
Cooperative shared memory: software and hardware for scalable multiprocessors

ACM Transactions on Computer Systems (TOCS)
Adaptive cache coherency for detecting migratory shared data

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
An adaptive cache coherence protocol optimized for migratory sharing

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Reactive synchronization algorithms for multiprocessors

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
A compiler algorithm that reduces read latency in ownership-based cache coherence protocols

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
STiNG: a CC-NUMA computer system for the commercial marketplace

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Efficient synchronization: let them eat QOLB

Proceedings of the 24th annual international symposium on Computer architecture
The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture
Memory system characterization of commercial workloads

Proceedings of the 25th annual international symposium on Computer architecture
Performance characterization of a Quad Pentium Pro SMP using OLTP workloads

Proceedings of the 25th annual international symposium on Computer architecture
Performance of database workloads on shared-memory systems with out-of-order processors

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Memory consistency and event ordering in scalable shared-memory multiprocessors

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Selective, accurate, and timely self-invalidation using last-touch prediction

Proceedings of the 27th annual international symposium on Computer architecture
Architecture and design of AlphaServer GS320

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
IEEE Standard for Scalable Coherent Interface, Science: IEEE Std. 1596-1992

IEEE Standard for Scalable Coherent Interface, Science: IEEE Std. 1596-1992
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
The evolution of the HP/Convex Exemplar

COMPCON '97 Proceedings of the 42nd IEEE International Computer Conference
Lockup-free instruction fetch/prefetch cache organization

ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
Dynamic decentralized cache schemes for mimd parallel processors

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
An Evaluation of Fine-Grain Producer-Initiated Communication in Cache-Coherent Multiprocessors

HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Improving CC-NUMA Performance Using Instruction-Based Prediction

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
(R) The Impact of Speeding up Critical Sections with Data Prefetching and Forwarding

ICPP '96 Proceedings of the Proceedings of the 1996 International Conference on Parallel Processing - Volume 3

Inferential queueing and speculative push

International Journal of Parallel Programming - Special issue I: The 17th annual international conference on supercomputing (ICS'03)
Support for High-Frequency Streaming in CMPs

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Speeding-up synchronizations in DSM multiprocessors

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Communication latencies within critical sections constitute a major bottleneck in some classes of emerging parallel workloads. In this paper, we argue for the use of Inferentially Queued Locks (IQLs) [31], not just for efficient synchronization but also for reducing communication latencies, and we propose a novel mechanism, Speculative Push (SP), aimed at reducing these communication latencies. With IQLs, the processor infers the existence, and limits, of a critical section from the use of synchronization instructions and joins a queue of lock requestors. The SP mechanism extracts information about program structure by observing IQLs. SP allows the cache controller, responding to a request for a cache line that likely includes a lock variable, to predict the data sets the requestor will modify within the associated critical section. The controller then pushes these lines from its own cache to the target cache, as well as writing them to memory. Overlapping the protected data transfer with that of the lock can substantially reduce the communication latencies within critical sections. By pushing data in exclusive state, the mechanism can collapse a read-modify-write sequences within a critical section into a single local cache access. The write-back to memory allows the receiving cache to ignore the push. Neither mechanism requires any programmer or compiler support nor any instruction set changes. Our experiments demonstrate that IQLs and SP can improve performance of applications employing frequent synchronization.