A bridging model for parallel computation
Communications of the ACM
Optimal broadcast and summation in the LogP model
SPAA '93 Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures
Efficient algorithms for all-to-all communications in multi-port message-passing systems
SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Journal of Parallel and Distributed Computing - Special issue on scalability of parallel algorithms and architectures
Designing broadcasting algorithms in the Postal Model for message-passing systems
Proceedings of the 4th ACM symposium on Parallel algorithms and architectures
Automatic optimization of communication in compiling out-of-core stencil codes
ICS '96 Proceedings of the 10th international conference on Supercomputing
LogGP: incorporating long messages into the LogP model for parallel computation
Journal of Parallel and Distributed Computing
Lattice Boltzmann method for 3-D flows with curved boundary
Journal of Computational Physics
Parallel Algebraic Multigrid Methods on Distributed Memory Computers
SIAM Journal on Scientific Computing
Journal of Parallel and Distributed Computing - Special section best papers from the 2002 international parallel and distributed processing symposium
Send-receive considered harmful: Myths and realities of message passing
ACM Transactions on Programming Languages and Systems (TOPLAS)
STAR-MPI: self tuned adaptive routines for MPI collective operations
Proceedings of the 20th annual international conference on Supercomputing
Effective automatic parallelization of stencil computations
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
MPI collective algorithm selection and quadtree encoding
Parallel Computing
A time-split nonhydrostatic atmospheric model for weather research and forecasting applications
Journal of Computational Physics
Implementation and performance analysis of non-blocking collective operations for MPI
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Sparse Non-blocking Collectives in Quantum Mechanical Calculations
Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Sparse collective operations for MPI
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Two-tree algorithms for full bandwidth broadcast, reduction and scan
Parallel Computing
Overview of the Blue Gene/L system architecture
IBM Journal of Research and Development
Group Operation Assembly Language - A Flexible Way to Express Collective Communication
ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
Concurrency and Computation: Practice & Experience - International Supercomputing Conference
The Gemini System Interconnect
HOTI '10 Proceedings of the 2010 18th IEEE Symposium on High Performance Interconnects
The scalable process topology interface of MPI 2.2
Concurrency and Computation: Practice & Experience
High performance RDMA protocols in HPC
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Communication Requirements and Interconnect Optimization for High-End Scientific Applications
IEEE Transactions on Parallel and Distributed Systems
Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Optimization of MPI persistent communication
Proceedings of the 20th European MPI Users' Group Meeting
Proceedings of the 20th European MPI Users' Group Meeting
The Servet 3.0 benchmark suite: Characterization of network performance degradation
Computers and Electrical Engineering
Hi-index | 0.00 |
Many scientific applications operate in a bulk-synchronous mode of iterative communication and computation steps. Even though the communication steps happen at the same logical time, important patterns such as stencil computations cannot be expressed as collective communications in MPI. We demonstrate how neighborhood collective operations allow to specify arbitrary collective communication relations during run-time and enable optimizations similar to traditional collective calls. We show a number of optimization opportunities and algorithms for different communication scenarios. We also show how users can assert constraints that provide additional optimization opportunities in a portable way. We demonstrate the utility of all described optimizations in a highly optimized implementation of neighborhood collective operations. Our communication and protocol optimizations result in a performance improvement of up to a factor of two for small stencil communications. We found that, for some patterns, our optimization heuristics automatically generate communication schedules that are comparable to hand-tuned collectives. With those optimizations in place, we are able to accelerate arbitrary collective communication patterns, such as regular and irregular stencils with optimization methods for collective communications. We expect that our methods will influence the design of future MPI libraries and provide a significant performance benefit on large-scale systems.