Bandwidth-Efficient Collective Communication for Clustered Wide Area Systems

Authors:
Affiliations:
Venue:
IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Year:
2000

Citing 0
Cited 23

Message passing without send-receive

Future Generation Computer Systems - Parallel computing technologies (PaCT-2001)
Exploiting Hierarchy in Heterogeneous Environments

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Fast Measurement of LogP Parameters for Message Passing Platforms

IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
Send-Recv Considered Harmful? Myths and Truths about Parallel Programming

PaCT '01 Proceedings of the 6th International Conference on Parallel Computing Technologies
Improved MPI All-to-all Communication on a Giganet SMP Cluster

Proceedings of the 9th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
SAT: a programming methodology with skeletons and collective operations

Patterns and skeletons for parallel and distributed computing
Send-receive considered harmful: Myths and realities of message passing

ACM Transactions on Programming Languages and Systems (TOPLAS)
Improving the execution time of global communication operations

Proceedings of the 1st conference on Computing frontiers
Broadcasting on networks of workstations

Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
Performance Modeling and Tuning Strategies of Mixed Mode Collective Communications

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees

GRID '05 Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing
The design and implementation of MPI collective operations for clusters in long-and-fast networks

Cluster Computing
Efficient high performance collective communication for the cell blade

Proceedings of the 23rd international conference on Supercomputing
Modeling advanced collective communication algorithms on cell-based systems

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
An efficient collective communication method for grid scale networks

ICCS'03 Proceedings of the 2003 international conference on Computational science
Application-oriented adaptive MPI_Bcast for grids

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
An analytical model for multilevel performance prediction of Multi-FPGA systems

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Performance modeling for multilevel communication in SHMEM+

Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
Dynamically adaptive binomial trees for broadcasting in heterogeneous networks of workstations

VECPAR'04 Proceedings of the 6th international conference on High Performance Computing for Computational Science
Topology-Based hypercube structures for global communication in heterogeneous networks

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Efficient and reliable network tomography in heterogeneous networks using BitTorrent broadcasts and clustering algorithms

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
MPI vs. bittorrent: switching between large-message broadcast algorithms in the presence of bottleneck links

Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Efficient and reliable network tomography in heterogeneous networks using BitTorrent broadcasts and clustering algorithms

Scientific Programming - Selected Papers from Super Computing 2012

Quantified Score

Hi-index	0.01

Visualization

Abstract

Metacomputing infrastructures couple multiple clusters (or MPPs) via wide-area networks. A major problem in programming parallel applications for such platforms is their hierarchical network structure: latency and bandwidth of WANs often are orders of magnitude worse than those of local networks. Our goal is to optimize MPI's collective operations for such platforms.In this paper, we focus on optimized utilization of the (scarce) wide-area bandwidth. We use two techniques: selecting suitable communication graph shapes, and splitting messages into multiple segments that are sent in parallel over different WAN links. To determine the best graph shape and segment size, we introduce a performance model called parameterized LogP (P-LogP), a hierarchical extension of the LogP model that covers messages of arbitrary length. With P-LogP, the optimal segment size and the best-broadcast tree shape can be determined at runtime. (For conciseness, we restrict our discussion to the broadcast operation.) An experimental performance evaluation shows that the new broadcast has significantly improved performance (for large messages) and that there is a close match between the theoretical model and the measured completion times.