Message passing without send-receive
Future Generation Computer Systems - Parallel computing technologies (PaCT-2001)
Exploiting Hierarchy in Heterogeneous Environments
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Fast Measurement of LogP Parameters for Message Passing Platforms
IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
Send-Recv Considered Harmful? Myths and Truths about Parallel Programming
PaCT '01 Proceedings of the 6th International Conference on Parallel Computing Technologies
Improved MPI All-to-all Communication on a Giganet SMP Cluster
Proceedings of the 9th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
SAT: a programming methodology with skeletons and collective operations
Patterns and skeletons for parallel and distributed computing
Send-receive considered harmful: Myths and realities of message passing
ACM Transactions on Programming Languages and Systems (TOPLAS)
Improving the execution time of global communication operations
Proceedings of the 1st conference on Computing frontiers
Broadcasting on networks of workstations
Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
Performance Modeling and Tuning Strategies of Mixed Mode Collective Communications
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees
GRID '05 Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing
Efficient high performance collective communication for the cell blade
Proceedings of the 23rd international conference on Supercomputing
Modeling advanced collective communication algorithms on cell-based systems
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
An efficient collective communication method for grid scale networks
ICCS'03 Proceedings of the 2003 international conference on Computational science
Application-oriented adaptive MPI_Bcast for grids
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
An analytical model for multilevel performance prediction of Multi-FPGA systems
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Performance modeling for multilevel communication in SHMEM+
Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
Dynamically adaptive binomial trees for broadcasting in heterogeneous networks of workstations
VECPAR'04 Proceedings of the 6th international conference on High Performance Computing for Computational Science
Topology-Based hypercube structures for global communication in heterogeneous networks
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Scientific Programming - Selected Papers from Super Computing 2012
Hi-index | 0.01 |
Metacomputing infrastructures couple multiple clusters (or MPPs) via wide-area networks. A major problem in programming parallel applications for such platforms is their hierarchical network structure: latency and bandwidth of WANs often are orders of magnitude worse than those of local networks. Our goal is to optimize MPI's collective operations for such platforms.In this paper, we focus on optimized utilization of the (scarce) wide-area bandwidth. We use two techniques: selecting suitable communication graph shapes, and splitting messages into multiple segments that are sent in parallel over different WAN links. To determine the best graph shape and segment size, we introduce a performance model called parameterized LogP (P-LogP), a hierarchical extension of the LogP model that covers messages of arbitrary length. With P-LogP, the optimal segment size and the best-broadcast tree shape can be determined at runtime. (For conciseness, we restrict our discussion to the broadcast operation.) An experimental performance evaluation shows that the new broadcast has significantly improved performance (for large messages) and that there is a close match between the theoretical model and the measured completion times.