Algorithms for scalable synchronization on shared-memory multiprocessors
ACM Transactions on Computer Systems (TOCS)
LogP: towards a realistic model of parallel computation
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems
IEEE Transactions on Parallel and Distributed Systems
Optimization of MPI collective communication on BlueGene/L systems
Proceedings of the 19th annual international conference on Supercomputing
Performance without pain = productivity: data layout and collective communication in UPC
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Optimization of sparse matrix-vector multiplication on emerging multicore platforms
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Efficient RDMA-based multi-port collectives on multi-rail QsNetII clusters
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Optimizing bandwidth limited problems using one-sided communication and overlap
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Minimizing communication in sparse matrix solvers
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Composing parallel software efficiently with lithe
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Optimizing UPC programs for multi-core systems
Scientific Programming - Exploring Languages for Expressing Medium to Massive On-Chip Parallelism
Juggle: proactive load balancing on multicore computers
Proceedings of the 20th international symposium on High performance distributed computing
Minimizing synchronizations in sparse iterative solvers for distributed supercomputers
Computers & Mathematics with Applications
Hi-index | 0.00 |
As the gap in performance between the processors and the memory systems continue to grow, the communication component of an application will dictate the overall application performance and scalability. Therefore it is useful to abstract common communication operations across cores as collective communication operations and tune them through a runtime library that can employ sophisticated automatic tuning techniques. Our focus of this paper is on collective communication in Partitioned Global Address Space languages which are a natural extension of the shared memory hardware of modern multicore systems. In particular we highlight how automatic tuning can lead to significant performance improvements and show how loosening the synchronization semantics of a collective can lead to a more efficient use of the memory system. We demonstrate that loosely synchronized collectives can realize consistent 3× speedups over their strictly synchronized counterparts on the highly threaded Sun Niagara2 for message sizes ranging from 8 bytes to 64kB. We thus argue that the synchronization requirements for a collective must be exposed in the interface so the collective and the synchronization can be optimized together.