Two algorithms for barrier synchronization
International Journal of Parallel Programming
LogP: towards a realistic model of parallel computation
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Harness: a next generation distributed virtual machine
Future Generation Computer Systems - Special issue on metacomputing
MPI: The Complete Reference
Message passing without send-receive
Future Generation Computer Systems - Parallel computing technologies (PaCT-2001)
Towards an Accurate Model for Collective Communications
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Statistical Models for Automatic Performance Tuning
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Fault Tolerant MPI for the HARNESS Meta-computing System
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Send-Recv Considered Harmful? Myths and Truths about Parallel Programming
PaCT '01 Proceedings of the 6th International Conference on Parallel Computing Technologies
SAT: a programming methodology with skeletons and collective operations
Patterns and skeletons for parallel and distributed computing
HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
Send-receive considered harmful: Myths and realities of message passing
ACM Transactions on Programming Languages and Systems (TOPLAS)
MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Performance Analysis of MPI Collective Operations
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 15 - Volume 16
Statistical Models for Empirical Search-Based Performance Tuning
International Journal of High Performance Computing Applications
Towards an Accurate Model for Collective Communications
International Journal of High Performance Computing Applications
Automatic generation and tuning of MPI collective communication routines
Proceedings of the 19th annual international conference on Supercomputing
Performance Modeling and Tuning Strategies of Mixed Mode Collective Communications
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
A Reconfigurable MPI Broadcast Function
HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Self-adapting numerical software (SANS) effort
IBM Journal of Research and Development
STAR-MPI: self tuned adaptive routines for MPI collective operations
Proceedings of the 20th annual international conference on Supercomputing
Performance analysis of MPI collective operations
Cluster Computing
Designing polylibraries to speed up linear algebra computations
International Journal of High Performance Computing and Networking
Implementation and performance analysis of non-blocking collective operations for MPI
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Techniques for pipelined broadcast on ethernet switched clusters
Journal of Parallel and Distributed Computing
Proceedings of the 22nd annual international conference on Supercomputing
A study of process arrival patterns for MPI collective operations
International Journal of Parallel Programming
Using experimental data to improve the performance modelling of parallel linear algebra routines
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Fast barrier synchronization for InfiniBand™
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A decomposition approach for optimizing the performance of MPI libraries
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Optimizing bandwidth limited problems using one-sided communication and overlap
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Pipelined broadcast on ethernet switched clusters
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
High-performance modeling acoustic and elastic waves using the parallel Dichotomy Algorithm
Journal of Computational Physics
Hiding latency in Coarray Fortran 2.0
Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
pupyMPI - MPI implemented in pure python
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Collective communication costs analysis over gigabit ethernet and infiniband
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Self-adaptive hints for collective i/o
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
A proposal of reconfigurable MPI collective communication functions
ISPA'05 Proceedings of the Third international conference on Parallel and Distributed Processing and Applications
ScoPred–scalable user-directed performance prediction using complexity modeling and historical data
JSSPP'05 Proceedings of the 11th international conference on Job Scheduling Strategies for Parallel Processing
Improving multilevel approach for optimizing collective communications in computational grids
EGC'05 Proceedings of the 2005 European conference on Advances in Grid Computing
Computers and Electrical Engineering
Performance analysis and optimization of MPI collective operations on multi-core clusters
The Journal of Supercomputing
A case for standard non-blocking collective operations
PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Optimization of collective communications in HeteroMPI
PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
A compound OpenMP/MPI program development toolkit for hybrid CPU/GPU clusters
The Journal of Supercomputing
Hi-index | 0.00 |
The performance of the MPI's collective communications is critical in most MPI-based applications. A general algorithm for a given collective communication operation may not give good performance on all systems due to the differences in architectures, network parameters and the storage capacity of the underlying MPI implementation. In this paper we discuss an approach in which the collective communications are tuned for a given system by conducting a series of experiments on the system. We also discuss a dynamic topology method that uses the tuned static topology shape, but re-orders the logical addresses to compensate for changing run time variations. A series of experiments were conducted comparing our tuned collective communication operations to various native vendor MPI implementations. The use of the tuned collective communications resulted in about 30 percent to 650 percent improvement in performance over the native MPI implementations.