A bridging model for parallel computation
Communications of the ACM
Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems
IEEE Transactions on Parallel and Distributed Systems
Performance Evaluation of the Quadrics Interconnection Network
Cluster Computing
Optimization of MPI collective communication on BlueGene/L systems
Proceedings of the 19th annual international conference on Supercomputing
Implementation and performance analysis of non-blocking collective operations for MPI
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Sparse collective operations for MPI
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Two-tree algorithms for full bandwidth broadcast, reduction and scan
Parallel Computing
Group Operation Assembly Language - A Flexible Way to Express Collective Communication
ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Design of kernel-level asynchronous collective communication
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
High-performance message-passing over generic Ethernet hardware with Open-MX
Parallel Computing
Design and Implementation of Portable and Efficient Non-blocking Collective Communication
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Hi-index | 0.00 |
Optimized implementations of blocking and nonblocking collective operations are most important for scalable high-performance applications. Offloading such collective operations into the communication layer can improve performance and asynchronous progression of the operations. However, it is most important that such offloading schemes remain flexible in order to support user-defined (sparse neighbor) collective communications. In this work, we describe an operating system kernel-based architecture for implementing an interpreter for the flexible Group Operation Assembly Language (GOAL) framework to offload collective communications. We describe an optimized scheme to store the schedules that define the collective operations and show an extension to profile the performance of the kernel layer. Our microbenchmarks demonstrate the effectiveness of the approach and we show performance improvements over traditional progression in user-space. We also discuss complications with the design and offloading strategies in general.