Multiprocessor cache synchronization: issues, innovations, evolution
ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Efficient synchronization primitives for large-scale cache-coherent multiprocessors
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Algorithms for scalable synchronization on shared-memory multiprocessors
ACM Transactions on Computer Systems (TOCS)
Tolerating latency through software-controlled prefetching in shared-memory multiprocessors
Journal of Parallel and Distributed Computing - Special issue on shared-memory multiprocessors
The Stanford Dash Multiprocessor
Computer
SPLASH: Stanford parallel applications for shared-memory
ACM SIGARCH Computer Architecture News
Cooperative shared memory: software and hardware for scalable multiprocessors
ACM Transactions on Computer Systems (TOCS)
Adaptive cache coherency for detecting migratory shared data
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
An adaptive cache coherence protocol optimized for migratory sharing
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Reactive synchronization algorithms for multiprocessors
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
A compiler algorithm that reduces read latency in ownership-based cache coherence protocols
PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
STiNG: a CC-NUMA computer system for the commercial marketplace
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Efficient synchronization: let them eat QOLB
Proceedings of the 24th annual international symposium on Computer architecture
The SGI Origin: a ccNUMA highly scalable server
Proceedings of the 24th annual international symposium on Computer architecture
Memory system characterization of commercial workloads
Proceedings of the 25th annual international symposium on Computer architecture
Performance characterization of a Quad Pentium Pro SMP using OLTP workloads
Proceedings of the 25th annual international symposium on Computer architecture
Performance of database workloads on shared-memory systems with out-of-order processors
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Memory consistency and event ordering in scalable shared-memory multiprocessors
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Selective, accurate, and timely self-invalidation using last-touch prediction
Proceedings of the 27th annual international symposium on Computer architecture
Architecture and design of AlphaServer GS320
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
IEEE Standard for Scalable Coherent Interface, Science: IEEE Std. 1596-1992
IEEE Standard for Scalable Coherent Interface, Science: IEEE Std. 1596-1992
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
The evolution of the HP/Convex Exemplar
COMPCON '97 Proceedings of the 42nd IEEE International Computer Conference
Lockup-free instruction fetch/prefetch cache organization
ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
Dynamic decentralized cache schemes for mimd parallel processors
ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
An Evaluation of Fine-Grain Producer-Initiated Communication in Cache-Coherent Multiprocessors
HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Improving CC-NUMA Performance Using Instruction-Based Prediction
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
(R) The Impact of Speeding up Critical Sections with Data Prefetching and Forwarding
ICPP '96 Proceedings of the Proceedings of the 1996 International Conference on Parallel Processing - Volume 3
Inferential queueing and speculative push
International Journal of Parallel Programming - Special issue I: The 17th annual international conference on supercomputing (ICS'03)
Support for High-Frequency Streaming in CMPs
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Speeding-up synchronizations in DSM multiprocessors
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Hi-index | 0.00 |
Communication latencies within critical sections constitute a major bottleneck in some classes of emerging parallel workloads. In this paper, we argue for the use of Inferentially Queued Locks (IQLs) [31], not just for efficient synchronization but also for reducing communication latencies, and we propose a novel mechanism, Speculative Push (SP), aimed at reducing these communication latencies. With IQLs, the processor infers the existence, and limits, of a critical section from the use of synchronization instructions and joins a queue of lock requestors. The SP mechanism extracts information about program structure by observing IQLs. SP allows the cache controller, responding to a request for a cache line that likely includes a lock variable, to predict the data sets the requestor will modify within the associated critical section. The controller then pushes these lines from its own cache to the target cache, as well as writing them to memory. Overlapping the protected data transfer with that of the lock can substantially reduce the communication latencies within critical sections. By pushing data in exclusive state, the mechanism can collapse a read-modify-write sequences within a critical section into a single local cache access. The write-back to memory allows the receiving cache to ignore the push. Neither mechanism requires any programmer or compiler support nor any instruction set changes. Our experiments demonstrate that IQLs and SP can improve performance of applications employing frequent synchronization.