Optimization principles for collective neighborhood communications

Authors:
Torsten Hoefler;Timo Schneider
Affiliations:
ETH Zurich, Switzerland;University of Illinois at Urbana-Champaign, Urbana, IL
Venue:
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2012

Citing 26
Cited 4

A bridging model for parallel computation

Communications of the ACM
Optimal broadcast and summation in the LogP model

SPAA '93 Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures
Efficient algorithms for all-to-all communications in multi-port message-passing systems

SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Communication optimizations for irregular scientific computations on distributed memory architectures

Journal of Parallel and Distributed Computing - Special issue on scalability of parallel algorithms and architectures
Designing broadcasting algorithms in the Postal Model for message-passing systems

Proceedings of the 4th ACM symposium on Parallel algorithms and architectures
Automatic optimization of communication in compiling out-of-core stencil codes

ICS '96 Proceedings of the 10th international conference on Supercomputing
LogGP: incorporating long messages into the LogP model for parallel computation

Journal of Parallel and Distributed Computing
Lattice Boltzmann method for 3-D flows with curved boundary

Journal of Computational Physics
Parallel Algebraic Multigrid Methods on Distributed Memory Computers

SIAM Journal on Scientific Computing
Communication characteristics of large-scale scientific applications for contemporary cluster architectures

Journal of Parallel and Distributed Computing - Special section best papers from the 2002 international parallel and distributed processing symposium
Send-receive considered harmful: Myths and realities of message passing

ACM Transactions on Programming Languages and Systems (TOPLAS)
STAR-MPI: self tuned adaptive routines for MPI collective operations

Proceedings of the 20th annual international conference on Supercomputing
Effective automatic parallelization of stencil computations

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
MPI collective algorithm selection and quadtree encoding

Parallel Computing
A time-split nonhydrostatic atmospheric model for weather research and forecasting applications

Journal of Computational Physics
Implementation and performance analysis of non-blocking collective operations for MPI

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Sparse Non-blocking Collectives in Quantum Mechanical Calculations

Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Sparse collective operations for MPI

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Two-tree algorithms for full bandwidth broadcast, reduction and scan

Parallel Computing
Overview of the Blue Gene/L system architecture

IBM Journal of Research and Development
Group Operation Assembly Language - A Flexible Way to Express Collective Communication

ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
Towards performance portability through runtime adaptation for high-performance computing applications

Concurrency and Computation: Practice & Experience - International Supercomputing Conference
The Gemini System Interconnect

HOTI '10 Proceedings of the 2010 18th IEEE Symposium on High Performance Interconnects
The scalable process topology interface of MPI 2.2

Concurrency and Computation: Practice & Experience
High performance RDMA protocols in HPC

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Communication Requirements and Interconnect Optimization for High-End Scientific Applications

IEEE Transactions on Parallel and Distributed Systems

Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Optimization of MPI persistent communication

Proceedings of the 20th European MPI Users' Group Meeting
Revisiting rendezvous protocols in the context of RDMA-capable host channel adapters and many-core processors

Proceedings of the 20th European MPI Users' Group Meeting
The Servet 3.0 benchmark suite: Characterization of network performance degradation

Computers and Electrical Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many scientific applications operate in a bulk-synchronous mode of iterative communication and computation steps. Even though the communication steps happen at the same logical time, important patterns such as stencil computations cannot be expressed as collective communications in MPI. We demonstrate how neighborhood collective operations allow to specify arbitrary collective communication relations during run-time and enable optimizations similar to traditional collective calls. We show a number of optimization opportunities and algorithms for different communication scenarios. We also show how users can assert constraints that provide additional optimization opportunities in a portable way. We demonstrate the utility of all described optimizations in a highly optimized implementation of neighborhood collective operations. Our communication and protocol optimizations result in a performance improvement of up to a factor of two for small stencil communications. We found that, for some patterns, our optimization heuristics automatically generate communication schedules that are comparable to hand-tuned collectives. With those optimizations in place, we are able to accelerate arbitrary collective communication patterns, such as regular and irregular stencils with optimization methods for collective communications. We expect that our methods will influence the design of future MPI libraries and provide a significant performance benefit on large-scale systems.