Automatically tuned collective communications

Authors:
Sathish S. Vadhiyar;Graham E. Fagg;Jack Dongarra
Affiliations:
Department of Computer & Information Science, National Chiao Tung University, Hsinchu, Taiwan 300, R. O. C.;Department of Electrical and Computer Engineering, University of Wisconsin at Madison, Madison, WI;Computer Science and Engineering, Seoul National University, Seoul, 151-742, Korea
Venue:
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Year:
2000

Citing 4
Cited 43

Two algorithms for barrier synchronization

International Journal of Parallel Programming
LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Harness: a next generation distributed virtual machine

Future Generation Computer Systems - Special issue on metacomputing
MPI: The Complete Reference

MPI: The Complete Reference

Message passing without send-receive

Future Generation Computer Systems - Parallel computing technologies (PaCT-2001)
Towards an Accurate Model for Collective Communications

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Statistical Models for Automatic Performance Tuning

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Fault Tolerant MPI for the HARNESS Meta-computing System

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Send-Recv Considered Harmful? Myths and Truths about Parallel Programming

PaCT '01 Proceedings of the 6th International Conference on Parallel Computing Technologies
SAT: a programming methodology with skeletons and collective operations

Patterns and skeletons for parallel and distributed computing
Local Discovery of System Architecture - Application Parameter Sensitivity: An Empirical Technique for Adaptive Grid Applications

HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
Send-receive considered harmful: Myths and realities of message passing

ACM Transactions on Programming Languages and Systems (TOPLAS)
MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Performance Analysis of MPI Collective Operations

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 15 - Volume 16
Statistical Models for Empirical Search-Based Performance Tuning

International Journal of High Performance Computing Applications
Towards an Accurate Model for Collective Communications

International Journal of High Performance Computing Applications
Automatic generation and tuning of MPI collective communication routines

Proceedings of the 19th annual international conference on Supercomputing
Performance Modeling and Tuning Strategies of Mixed Mode Collective Communications

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
A Reconfigurable MPI Broadcast Function

HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Collective communication on architectures that support simultaneous communication over multiple links

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Self-adapting numerical software (SANS) effort

IBM Journal of Research and Development
STAR-MPI: self tuned adaptive routines for MPI collective operations

Proceedings of the 20th annual international conference on Supercomputing
Performance analysis of MPI collective operations

Cluster Computing
Designing polylibraries to speed up linear algebra computations

International Journal of High Performance Computing and Networking
Implementation and performance analysis of non-blocking collective operations for MPI

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Techniques for pipelined broadcast on ethernet switched clusters

Journal of Parallel and Distributed Computing
Adaptive runtime tuning of parallel sparse matrix-vector multiplication on distributed memory systems

Proceedings of the 22nd annual international conference on Supercomputing
Efficient shared memory and RDMA based collectives on multi-rail QsNetII SMP clusters

Cluster Computing
A study of process arrival patterns for MPI collective operations

International Journal of Parallel Programming
Using experimental data to improve the performance modelling of parallel linear algebra routines

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Fast barrier synchronization for InfiniBand™

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A decomposition approach for optimizing the performance of MPI libraries

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Optimizing bandwidth limited problems using one-sided communication and overlap

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Pipelined broadcast on ethernet switched clusters

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
High-performance modeling acoustic and elastic waves using the parallel Dichotomy Algorithm

Journal of Computational Physics
Hiding latency in Coarray Fortran 2.0

Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
pupyMPI - MPI implemented in pure python

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Collective communication costs analysis over gigabit ethernet and infiniband

HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Self-adaptive hints for collective i/o

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
A proposal of reconfigurable MPI collective communication functions

ISPA'05 Proceedings of the Third international conference on Parallel and Distributed Processing and Applications
ScoPred–scalable user-directed performance prediction using complexity modeling and historical data

JSSPP'05 Proceedings of the 11th international conference on Job Scheduling Strategies for Parallel Processing
Improving multilevel approach for optimizing collective communications in computational grids

EGC'05 Proceedings of the 2005 European conference on Advances in Grid Computing
Automatic mapping of parallel applications on multicore architectures using the Servet benchmark suite

Computers and Electrical Engineering
Performance analysis and optimization of MPI collective operations on multi-core clusters

The Journal of Supercomputing
A case for standard non-blocking collective operations

PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Optimization of collective communications in HeteroMPI

PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
A compound OpenMP/MPI program development toolkit for hybrid CPU/GPU clusters

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The performance of the MPI's collective communications is critical in most MPI-based applications. A general algorithm for a given collective communication operation may not give good performance on all systems due to the differences in architectures, network parameters and the storage capacity of the underlying MPI implementation. In this paper we discuss an approach in which the collective communications are tuned for a given system by conducting a series of experiments on the system. We also discuss a dynamic topology method that uses the tuned static topology shape, but re-orders the logical addresses to compensate for changing run time variations. A series of experiments were conducted comparing our tuned collective communication operations to various native vendor MPI implementations. The use of the tuned collective communications resulted in about 30 percent to 650 percent improvement in performance over the native MPI implementations.