Optimizing collective communication on multicores

Authors:
Rajesh Nishtala;Katherine A. Yelick
Affiliations:
University of California, Berkeley;University of California, Berkeley
Venue:
HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
Year:
2009

Citing 10
Cited 7

Algorithms for scalable synchronization on shared-memory multiprocessors

ACM Transactions on Computer Systems (TOCS)
LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems

IEEE Transactions on Parallel and Distributed Systems
Optimization of MPI collective communication on BlueGene/L systems

Proceedings of the 19th annual international conference on Supercomputing
Performance without pain = productivity: data layout and collective communication in UPC

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
MPI Collectives on Modern Multicore Clusters: Performance Optimizations and Communication Characteristics

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Efficient RDMA-based multi-port collectives on multi-rail QsNetII clusters

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Optimizing bandwidth limited problems using one-sided communication and overlap

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing

Minimizing communication in sparse matrix solvers

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Load balancing on speed

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Composing parallel software efficiently with lithe

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Optimizing UPC programs for multi-core systems

Scientific Programming - Exploring Languages for Expressing Medium to Massive On-Chip Parallelism
Juggle: proactive load balancing on multicore computers

Proceedings of the 20th international symposium on High performance distributed computing
Juggle: addressing extrinsic load imbalances in SPMD applications on multicore computers

Cluster Computing
Minimizing synchronizations in sparse iterative solvers for distributed supercomputers

Computers & Mathematics with Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the gap in performance between the processors and the memory systems continue to grow, the communication component of an application will dictate the overall application performance and scalability. Therefore it is useful to abstract common communication operations across cores as collective communication operations and tune them through a runtime library that can employ sophisticated automatic tuning techniques. Our focus of this paper is on collective communication in Partitioned Global Address Space languages which are a natural extension of the shared memory hardware of modern multicore systems. In particular we highlight how automatic tuning can lead to significant performance improvements and show how loosening the synchronization semantics of a collective can lead to a more efficient use of the memory system. We demonstrate that loosely synchronized collectives can realize consistent 3× speedups over their strictly synchronized counterparts on the highly threaded Sun Niagara2 for message sizes ranging from 8 bytes to 64kB. We thus argue that the synchronization requirements for a collective must be exposed in the interface so the collective and the synchronization can be optimized together.