A Calculus of Communicating Systems
A Calculus of Communicating Systems
Optimization of MPI collective communication on BlueGene/L systems
Proceedings of the 19th annual international conference on Supercomputing
An analysis of the impact of multi-threading on communication performance
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Design of kernel-level asynchronous collective communication
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Computer Science - Research and Development
Kernel-based offload of collective operations: implementation, evaluation and lessons learned
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Design and evaluation of nonblocking collective I/O operations
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium
Building portable thread schedulers for hierarchical multiprocessors: the bubblesched framework
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Hi-index | 0.00 |
Non-blocking communications are widely used in parallel applications for hiding communication overheads through overlapped computation and communication. While most of the existing implementations provide a non-blocking version of point-to-point communications, there is no portable and efficient implementation of non-blocking collectives, partly because application execution contexts need to be interrupted by dependent communications. This paper presents a portable and efficient user-level implementation technique of non-blocking communications. It allows users to design non-blocking collectives by declaring their operations and dependencies using provided APIs without being concerned with complicated management of their progression. While user-level implementations can be less efficient than kernel-level ones due to the cost of OS context switches, we solve this problem by employing the Marcel user level light-weight thread library when invoking communication operations. More specifically, each communication operation is mapped to one Marcel thread and scheduled to be executed when each operation's dependencies are satisfied by certain events. All executable operations and main user thread are executed simultaneously without any explicit invocations. Performance evaluations with micro benchmarks demonstrate the effectiveness of our proposed technique. Compared to existing OS-thread based method, it reduces CPU load to less than 10% while achieving similar level of communication latencies. We also discuss and compare the descriptive power of internal expressions for non-blocking communications.