Distributing Hot-Spot Addressing in Large-Scale Multiprocessors
IEEE Transactions on Computers
An evaluation of directory schemes for cache coherence
ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Analysis of cache invalidation patterns in multiprocessors
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Using feedback to control tree saturation in multistage interconnection networks
ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Ethernet: distributed packet switching for local computer networks
Communications of the ACM
PAX Computer; High-Speed Parallel Processing and Scientific Computing
PAX Computer; High-Speed Parallel Processing and Scientific Computing
The Performance Implications of Thread Management Alternatives for Shared-Memory Multiprocessors
IEEE Transactions on Computers
Snoopy cache test-and-test-and-set without execessive bus contention
ACM SIGARCH Computer Architecture News
Blocking: exploiting spatial locality for trace compaction
SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Counting networks and multi-processor coordination
STOC '91 Proceedings of the twenty-third annual ACM symposium on Theory of computing
Algorithms for scalable synchronization on shared-memory multiprocessors
ACM Transactions on Computer Systems (TOCS)
Synchronization without contention
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Scalable reader-writer synchronization for shared-memory multiprocessors
PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
Low contention load balancing on large-scale multiprocessors
SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
Waiting algorithms for synchronization in large-scale multiprocessors
ACM Transactions on Computer Systems (TOCS)
Fast, scalable synchronization with minimal hardware support
PODC '93 Proceedings of the twelfth annual ACM symposium on Principles of distributed computing
Hot spot analysis in large scale shared memory multiprocessors
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Diffracting trees (preliminary version)
SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Journal of the ACM (JACM)
A combinatorial treatment of balancing networks
PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
ACM Transactions on Computer Systems (TOCS)
Efficient techniques for fast nested barrier synchronization
Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
A combinatorial treatment of balancing networks
Journal of the ACM (JACM)
ACM Transactions on Computer Systems (TOCS)
A steady state analysis of diffracting trees (extended abstract)
Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Combining funnels: a new twist on an old tale…
PODC '98 Proceedings of the seventeenth annual ACM symposium on Principles of distributed computing
Characterizing the Performance of Algorithms for Lock-Free Objects
IEEE Transactions on Computers
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
A scalable lock-free stack algorithm
Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures
Linearizable counting networks
Distributed Computing
A linear-time algorithm for optimal barrier placement
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Using elimination to implement scalable and lock-free FIFO queues
Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
Adversarial contention resolution for simple channels
Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Self-tuning reactive diffracting trees
Journal of Parallel and Distributed Computing
Efficient self-tuning spin-locks using competitive analysis
Journal of Systems and Software
A scalable lock-free stack algorithm
Journal of Parallel and Distributed Computing
Decoupling contention management from scheduling
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Data structures in the multicore age
Communications of the ACM
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Lock cohorting: a general technique for designing NUMA locks
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Obstruction-Free algorithms can be practically wait-free
DISC'05 Proceedings of the 19th international conference on Distributed Computing
OPODIS'11 Proceedings of the 15th international conference on Principles of Distributed Systems
USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Hi-index | 0.02 |
Shared-memory multiprocessors commonly use shared variables for synchronization. Our simulations of real parallel applications show that large-scale cache-coherent multiprocessors suffer significant amounts of invalidation traffic due to synchronization. Large multiprocessors that do not cache synchronization variables are often more severely impacted. If this synchronization traffic is not reduced or managed adequately, synchronization references can cause severe congestion in the network. We propose a class of adaptive back-off methods that do not use any extra hardware and can significantly reduce the memory traffic to synchronization variables. These methods use synchronization state to reduce polling of synchronization variables. Our simulations show that when the number of processors participating in a barrier synchronization is small compared to the time of arrival of the processors, reductions of 20 percent to over 95 percent in synchronization traffic can be achieved at no extra cost. In other situations adaptive backoff techniques result in a tradeoff between reduced network accesses and increased processor idle time.