Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems
IEEE Transactions on Parallel and Distributed Systems
MagPIe: MPI's collective communication operations for clustered wide area systems
Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimization of MPI collectives on clusters of large-scale SMP's
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
The Hierarchical Factor Algorithm for All-to-All Communication (Research Note)
Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Communications of the ACM - A game experience in every application
Fast Collective Operations Using Shared and Remote Memory Access Protocols on Clusters
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Parallel Algorithm and Implementation for Realtime Dynamic Simulation of Power System
ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
Performance Evaluation of View-Oriented Parallel Programming
ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
Communicating efficiently on cluster based grids with MPICH-VMI
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Proceedings of the 20th annual international conference on Supercomputing
Topology mapping for Blue Gene/L supercomputer
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Optimizing task layout on the Blue Gene/L supercomputer
IBM Journal of Research and Development
A heuristic algorithm for mapping parallel applications on computational grids
EGC'05 Proceedings of the 2005 European conference on Advances in Grid Computing
Computers and Electrical Engineering
Dynamic thread mapping based on machine learning for transactional memory applications
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Hi-index | 0.00 |
It is an important problem to map virtual parallel processes to physical processors (or cores) in an optimized way to get scalable performance due to non-uniform communication cost in modern parallel computers. Existing work uses profile-guided approaches to optimize mapping schemes to minimize the cost of point-to-point communications automatically. However, these approaches cannot deal with collective communications and may get sub-optimal mappings for applications with collective communications. In this paper, we propose an approach called OPP (Optimized Process Placement) to handle collective communications which transforms collective communications into a series of point-to-point communication operations according to the implementation of collective communications in communication libraries. Then we can use existing approaches to find optimized mapping schemes which are optimized for both point-to-point and collective communications. We evaluated the performance of our approach with micro-benchmarks which include all MPI collective communications, NAS Parallel Benchmark suite and three other applications. Experimental results show that the optimized process placement generated by our approach can achieve significant speedup.