OMPI: optimizing MPI programs using partial evaluation
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
MPI-The Complete Reference, Volume 1: The MPI Core
MPI-The Complete Reference, Volume 1: The MPI Core
MPI: A Message-Passing Interface Standard
MPI: A Message-Passing Interface Standard
Automatic generation and tuning of MPI collective communication routines
Proceedings of the 19th annual international conference on Supercomputing
An MPI prototype for compiled communication on Ethernet switched clusters
Journal of Parallel and Distributed Computing - Special issue: Design and performance of networks for super-, cluster-, and grid-computing: Part I
An expert assistant for computer aided parallelization
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Automatic Run-time Parallelization and Transformation of I/O
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Run-time analysis and instrumentation for communication overlap potential
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Hi-index | 0.00 |
HPC users frequently develop and run their MPI programs without optimizing communication, leading to poor performance on clusters connected with regular Gigabit Ethernet. Unfortunately, optimizing communication patterns will often decrease the clarity and ease of modification of the code, and users desire to focus on the application problem and not the tool used to solve it. In this paper, our new method for automatically optimizing any application's communication is presented. All MPI calls are intercepted by a library we inject into the application. Memory associated with MPI requests is protected using hardware supported memory protection. The request can then continue in the background as an asynchronous operation while the application is allowed to continue as if the request is finished. Once the data is accessed by the application, a page fault will occur, and our injected library will wait for the background transfer to finish before allowing the application to continue. Performance close to that of manual optimization are observed on our test-cases when run on Gigabit Ethernet clusters.