Kernel-based offload of collective operations: implementation, evaluation and lessons learned

Authors:
Timo Schneider;Sven Eckelmann;Torsten Hoefler;Wolfgang Rehm
Affiliations:
TU Chemnitz, Germany;TU Chemnitz, Germany;University of Illinois at Urbana-Champaign, IL;TU Chemnitz, Germany
Venue:
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Year:
2011

Citing 11
Cited 1

A bridging model for parallel computation

Communications of the ACM
Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems

IEEE Transactions on Parallel and Distributed Systems
Performance Evaluation of the Quadrics Interconnection Network

Cluster Computing
Optimization of MPI collective communication on BlueGene/L systems

Proceedings of the 19th annual international conference on Supercomputing
Implementation and performance analysis of non-blocking collective operations for MPI

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Sparse collective operations for MPI

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Two-tree algorithms for full bandwidth broadcast, reduction and scan

Parallel Computing
Group Operation Assembly Language - A Flexible Way to Express Collective Communication

ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
ConnectX-2 InfiniBand Management Queues: First Investigation of the New Support for Network Offloaded Collective Operations

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Design of kernel-level asynchronous collective communication

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
High-performance message-passing over generic Ethernet hardware with Open-MX

Parallel Computing

Design and Implementation of Portable and Efficient Non-blocking Collective Communication

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Optimized implementations of blocking and nonblocking collective operations are most important for scalable high-performance applications. Offloading such collective operations into the communication layer can improve performance and asynchronous progression of the operations. However, it is most important that such offloading schemes remain flexible in order to support user-defined (sparse neighbor) collective communications. In this work, we describe an operating system kernel-based architecture for implementing an interpreter for the flexible Group Operation Assembly Language (GOAL) framework to offload collective communications. We describe an optimized scheme to store the schedules that define the collective operations and show an extension to profile the performance of the kernel layer. Our microbenchmarks demonstrate the effectiveness of the approach and we show performance improvements over traditional progression in user-space. We also discuss complications with the design and offloading strategies in general.