Performance of various computers using standard linear equations software
ACM SIGARCH Computer Architecture News
Efficient algorithms for all-to-all communications in multi-port message-passing systems
SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Fast, contention-free combining tree barriers for shared-memory multiprocessors
International Journal of Parallel Programming
Co-array Fortran for parallel programming
ACM SIGPLAN Fortran Forum
Automatic algorithm recognition and replacement: a new approach to program optimization
Automatic algorithm recognition and replacement: a new approach to program optimization
Performance Analysis of MPI Collective Operations
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 15 - Volume 16
An evaluation of global address space languages: co-array fortran and unified parallel C
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
X10: an object-oriented approach to non-uniform cluster computing
OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Proceedings of the 22nd annual international conference on Supercomputing
High-performance implementation of the level-3 BLAS
ACM Transactions on Mathematical Software (TOMS)
Using MPI Communication Patterns to Guide Source Code Transformations
ICCS '08 Proceedings of the 8th international conference on Computational Science, Part III
Detecting Patterns in MPI Communication Traces
ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
Communication-Sensitive Static Dataflow for Parallel Message Passing Applications
Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
Data Flow Analysis: Theory and Practice
Data Flow Analysis: Theory and Practice
Transforming MPI source code based on communication patterns
Future Generation Computer Systems
Two-tree algorithms for full bandwidth broadcast, reduction and scan
Parallel Computing
Self-Consistent MPI Performance Guidelines
IEEE Transactions on Parallel and Distributed Systems
A new vision for coarray Fortran
Proceedings of the Third Conference on Partitioned Global Address Space Programing Models
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
A Scalable and Distributed Dynamic Formal Verifier for MPI Programs
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
The PERCS High-Performance Interconnect
HOTI '10 Proceedings of the 2010 18th IEEE Symposium on High Performance Interconnects
The Gemini System Interconnect
HOTI '10 Proceedings of the 2010 18th IEEE Symposium on High Performance Interconnects
Communication-centric optimizations by dynamically detecting collective operations
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
LibWater: heterogeneous distributed computing made easy
Proceedings of the 27th international ACM conference on International conference on supercomputing
Hi-index | 0.00 |
Parallelism is steadily growing, remote-data access will soon dominate the execution time of large-scale applications. Many large-scale communication patterns expose significant structure that can be used to schedule communications accordingly. In this work, we identify concurrent communication patterns and transform them to semantically equivalent but faster communications. We show a directed acyclic graph formulation for communication schedules and concisely define their synchronization and data movement semantics. Our dataflow solver computes an internal representation (IR) that is amenable to pattern detection. We demonstrate a detection algorithm for our IR that is guaranteed to detect communication kernels on subsets of the graph and replace the subgraph with hardware accelerated or hand-tuned kernels. Those techniques are implemented in an open-source detection and transformation framework to optimize communication patterns. Experiments show that our techniques can improve the performance of representative example codes by several orders of magnitude on two different systems. However, we also show that some collective detection problems on process subsets are NP-hard. The developed analysis techniques are a first important step towards automatic large-scale communication transformations. Our developed techniques open several avenues for additional transformation heuristics and analyses. We expect that such communication analyses and transformations will become as natural as pattern detection, just-in-time compiler optimizations, and autotuning are today for serial codes.