Runtime detection and optimization of collective communication patterns

Authors:
Torsten Hoefler;Timo Schneider
Affiliations:
ETH Zurich, Zurich, Switzerland;University of Illinois at Urbana-Champaign, Urbana, IL, USA
Venue:
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Year:
2012

Citing 23
Cited 1

Performance of various computers using standard linear equations software

ACM SIGARCH Computer Architecture News
Efficient algorithms for all-to-all communications in multi-port message-passing systems

SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Fast, contention-free combining tree barriers for shared-memory multiprocessors

International Journal of Parallel Programming
Co-array Fortran for parallel programming

ACM SIGPLAN Fortran Forum
Automatic algorithm recognition and replacement: a new approach to program optimization

Automatic algorithm recognition and replacement: a new approach to program optimization
Performance Analysis of MPI Collective Operations

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 15 - Volume 16
An evaluation of global address space languages: co-array fortran and unified parallel C

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
X10: an object-oriented approach to non-uniform cluster computing

OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
The deep computing messaging framework: generalized scalable message passing on the blue gene/P supercomputer

Proceedings of the 22nd annual international conference on Supercomputing
High-performance implementation of the level-3 BLAS

ACM Transactions on Mathematical Software (TOMS)
Using MPI Communication Patterns to Guide Source Code Transformations

ICCS '08 Proceedings of the 8th international conference on Computational Science, Part III
Detecting Patterns in MPI Communication Traces

ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
Communication-Sensitive Static Dataflow for Parallel Message Passing Applications

Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
Data Flow Analysis: Theory and Practice

Data Flow Analysis: Theory and Practice
Transforming MPI source code based on communication patterns

Future Generation Computer Systems
Two-tree algorithms for full bandwidth broadcast, reduction and scan

Parallel Computing
Self-Consistent MPI Performance Guidelines

IEEE Transactions on Parallel and Distributed Systems
A new vision for coarray Fortran

Proceedings of the Third Conference on Partitioned Global Address Space Programing Models
ConnectX-2 InfiniBand Management Queues: First Investigation of the New Support for Network Offloaded Collective Operations

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
A Scalable and Distributed Dynamic Formal Verifier for MPI Programs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
The PERCS High-Performance Interconnect

HOTI '10 Proceedings of the 2010 18th IEEE Symposium on High Performance Interconnects
The Gemini System Interconnect

HOTI '10 Proceedings of the 2010 18th IEEE Symposium on High Performance Interconnects
Communication-centric optimizations by dynamically detecting collective operations

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming

LibWater: heterogeneous distributed computing made easy

Proceedings of the 27th international ACM conference on International conference on supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Parallelism is steadily growing, remote-data access will soon dominate the execution time of large-scale applications. Many large-scale communication patterns expose significant structure that can be used to schedule communications accordingly. In this work, we identify concurrent communication patterns and transform them to semantically equivalent but faster communications. We show a directed acyclic graph formulation for communication schedules and concisely define their synchronization and data movement semantics. Our dataflow solver computes an internal representation (IR) that is amenable to pattern detection. We demonstrate a detection algorithm for our IR that is guaranteed to detect communication kernels on subsets of the graph and replace the subgraph with hardware accelerated or hand-tuned kernels. Those techniques are implemented in an open-source detection and transformation framework to optimize communication patterns. Experiments show that our techniques can improve the performance of representative example codes by several orders of magnitude on two different systems. However, we also show that some collective detection problems on process subsets are NP-hard. The developed analysis techniques are a first important step towards automatic large-scale communication transformations. Our developed techniques open several avenues for additional transformation heuristics and analyses. We expect that such communication analyses and transformations will become as natural as pattern detection, just-in-time compiler optimizations, and autotuning are today for serial codes.