Scalable communication protocols for dynamic sparse data exchange

Authors:
Torsten Hoefler;Christian Siebert;Andrew Lumsdaine
Affiliations:
Indiana University, Bloomington, IN, USA;NEC Laboratories Europe, Sankt Augustin, Germany;Indiana University, Bloomington, IN, USA
Venue:
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Year:
2010

Citing 21
Cited 10

Two algorithms for barrier synchronization

International Journal of Parallel Programming
A bridging model for parallel computation

Communications of the ACM
Optimal computation of global sensitive functions in fast networks

Proceedings of the 4th international workshop on Distributed algorithms
Optimal broadcast and summation in the LogP model

SPAA '93 Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures
Designing broadcasting algorithms in the Postal Model for message-passing systems

Proceedings of the 4th ACM symposium on Parallel algorithms and architectures
Optimal computation of census functions in the postal model

Discrete Applied Mathematics
Efficient Algorithms for the Reduce-Scatter Operation in LogGP

IEEE Transactions on Parallel and Distributed Systems
Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems

IEEE Transactions on Parallel and Distributed Systems
LogGP: incorporating long messages into the LogP model for parallel computation

Journal of Parallel and Distributed Computing
Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator

ACM Transactions on Modeling and Computer Simulation (TOMACS) - Special issue on uniform random number generation
How processes learn

Proceedings of the fourth annual ACM symposium on Principles of distributed computing
Implementing the MPI process topology mechanism

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
An Evaluation of Current High-Performance Networks

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Lifting sequential graph algorithms for distributed-memory parallel computation

OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Parallel Algorithms for Evaluating Centrality Indices in Real-world Networks

ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Implementation and performance analysis of non-blocking collective operations for MPI

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Overlapping Communication and Computation with High Level Communication Routines

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
The deep computing messaging framework: generalized scalable message passing on the blue gene/P supercomputer

Proceedings of the 22nd annual international conference on Supercomputing
Overview of the IBM Blue Gene/P project

IBM Journal of Research and Development
Leveraging non-blocking collective communication in high-performance applications

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Sparse collective operations for MPI

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing

AM++: a generalized active message framework

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Transparent neutral element elimination in MPI reduction operations

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Kanor: a declarative language for explicit communication

PADL'11 Proceedings of the 13th international conference on Practical aspects of declarative languages
Active pebbles: parallel programming for data-driven applications

Proceedings of the international conference on Supercomputing
Cosmic microwave background map-making at the petascale and beyond

Proceedings of the international conference on Supercomputing
Neighborhood communication paradigm to increase scalability in large-scale dynamic scientific applications

Parallel Computing
Breaking the speed and scalability barriers for graph exploration on distributed-memory machines

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Compass: a scalable simulator for an architecture for cognitive computing

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Adoption protocols for fanout-optimal fault-tolerant termination detection

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Enabling highly-scalable remote memory access programming with MPI-3 one sided

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.01

Visualization

Abstract

Many large-scale parallel programs follow a bulk synchronous parallel (BSP) structure with distinct computation and communication phases. Although the communication phase in such programs may involve all (or large numbers) of the participating processes, the actual communication operations are usually sparse in nature. As a result, communication phases are typically expressed explicitly using point-to-point communication operations or collective operations. We define the dynamic sparse data-exchange (DSDE) problem and derive bounds in the well known LogGP model. While current approaches work well with static applications, they run into limitations as modern applications grow in scale, and as the problems that are being solved become increasingly irregular and dynamic. To enable the compact and efficient expression of the communication phase, we develop suitable sparse communication protocols for irregular applications at large scale. We discuss different irregular applications and show the sparsity in the communication for real-world input data. We discuss the time and memory complexity of commonly used protocols for the DSDE problem and develop NBX--a novel fast algorithm with constant memory overhead for solving it. Algorithm NBX improves the runtime of a sparse data-exchange among 8,192 processors on BlueGene/P by a factor of 5.6. In an application study, we show improvements of up to a factor of 28.9 for a parallel breadth first search on 8,192 BlueGene/P processors.