Stencils and problem partitionings: their influence on the performance of multiple processor systems
IEEE Transactions on Computers
Flattening on the Fly: Efficient Handling of MPI Derived Datatypes
Proceedings of the 6th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
A time-split nonhydrostatic atmospheric model for weather research and forecasting applications
Journal of Computational Physics
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Parallel zero-copy algorithms for fast Fourier transform and conjugate gradient using MPI datatypes
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Using MPI derived datatypes in numerical libraries
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Performance expectations and guidelines for MPI derived datatypes
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Automatic datatype generation and optimization
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Micro-applications for communication data access patterns and MPI datatypes
EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments
CLUSTER '12 Proceedings of the 2012 IEEE International Conference on Cluster Computing
Hi-index | 0.00 |
Data packing before and after communication can make up as much as 90% of the communication time on modern computers. Despite MPI's well-defined datatype interface for non-contiguous data access, many codes use manual pack loops for performance reasons. Programmers write access-pattern specific pack loops (e.g., do manual unrolling) for which compilers emit optimized code. In contrast, MPI implementations in use today interpret datatypes at pack time, resulting in high overheads. In this work we explore the effectiveness of using runtime compilation techniques to generate efficient and optimized pack code for MPI datatypes at commit time. Thus, none of the overhead of datatype interpretation is incurred at pack time and pack setup is as fast as calling a function pointer. We have implemented a library called libpack that can be used to compile and (un)pack MPI datatypes. The library optimizes the datatype representation and uses the LLVM framework to produce vectorized machine code for each datatype at commit time. We show several examples of how MPI datatype pack functions benefit from runtime compilation and analyze the performance of compiled pack functions for the data access patterns in many applications. We show that the pack/unpack functions generated by our packing library are seven times faster than those of prevalent MPI implementations for 73% of the datatypes used in a scientific application and in many cases outperform manual pack loops.