Automatic generation and tuning of MPI collective communication routines

Authors:
Ahmad Faraj;Xin Yuan
Affiliations:
Florida State University, Tallahassee, FL;Florida State University, Tallahassee, FL
Venue:
Proceedings of the 19th annual international conference on Supercomputing
Year:
2005

Citing 12
Cited 20

MPI-FM: high performance MPI on workstation clusters

Journal of Parallel and Distributed Computing - Special issue on workstation clusters and network-based computing
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

ICS '97 Proceedings of the 11th international conference on Supercomputing
Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems

IEEE Transactions on Parallel and Distributed Systems
MagPIe: MPI's collective communication operations for clustered wide area systems

Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimization of MPI collectives on clusters of large-scale SMP's

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Program transformation and runtime support for threaded MPI execution on shared-memory machines

ACM Transactions on Programming Languages and Systems (TOPLAS)
OMPI: optimizing MPI programs using partial evaluation

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Automatically tuned collective communications

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
CC--MPI: a compiled communication capable MPI prototype for ethernet switched clusters

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
A Framework for Collective Personalized Communication

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Message Scheduling for All-to-All Personalized Communication on Ethernet Switched Clusters

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01

Minimizing execution time in MPI programs on an energy-constrained, power-scalable cluster

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
STAR-MPI: self tuned adaptive routines for MPI collective operations

Proceedings of the 20th annual international conference on Supercomputing
Low Diameter Interconnections for Routing in High-Performance Parallel Systems

IEEE Transactions on Computers
A Message Scheduling Scheme for All-to-All Personalized Communication on Ethernet Switched Clusters

IEEE Transactions on Parallel and Distributed Systems
An efficient MPI_allgather for grids

Proceedings of the 16th international symposium on High performance distributed computing
A study of process arrival patterns for MPI collective operations

Proceedings of the 21st annual international conference on Supercomputing
Optimizing a conjugate gradient solver with non-blocking collective operations

Parallel Computing
Techniques for pipelined broadcast on ethernet switched clusters

Journal of Parallel and Distributed Computing
Adaptive runtime tuning of parallel sparse matrix-vector multiplication on distributed memory systems

Proceedings of the 22nd annual international conference on Supercomputing
Bandwidth efficient all-to-all broadcast on switched clusters

International Journal of Parallel Programming
A study of process arrival patterns for MPI collective operations

International Journal of Parallel Programming
Modeling advanced collective communication algorithms on cell-based systems

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Automatic and transparent optimizations of an application's MPI communication

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Optimizing bandwidth limited problems using one-sided communication and overlap

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Pipelined broadcast on ethernet switched clusters

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Hiding latency in Coarray Fortran 2.0

Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
Automatic mapping of parallel applications on multicore architectures using the Servet benchmark suite

Computers and Electrical Engineering
Automatic performance optimization of the discrete fourier transform on distributed memory computers

ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications
Performance analysis and optimization of MPI collective operations on multi-core clusters

The Journal of Supercomputing
Adaptive communication mechanism for accelerating MPI functions in NoC-based multicore processors

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In order for collective communication routines to achieve high performance on different platforms, they must be able to adapt to the system architecture and use different algorithms for different situations. Current Message Passing Interface (MPI) implementations, such as MPICH and LAM/MPI, are not fully adaptable to the system architecture and are not able to achieve high performance on many platforms. In this paper, we present a system that produces efficient MPI collective communication routines. By automatically generating topology specific routines and using an empirical approach to select the best implementations, our system adapts to a given platform and constructs routines that are customized for the platform. The experimental results show that the tuned routines consistently achieve high performance on clusters with different network topologies.