Grouping MPI Processes for Partial Checkpoint and Co-migration

Authors:
Rajendra Singh;Peter Graham
Affiliations:
Dept. of Computer Science, University of Manitoba, Winnipeg, Canada R3T 2N2;Dept. of Computer Science, University of Manitoba, Winnipeg, Canada R3T 2N2
Venue:
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Year:
2009

Citing 6
Cited 0

Mining Sequential Patterns: Generalizations and Performance Improvements

EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth

Proceedings of the 17th International Conference on Data Engineering
DynamicPVM - Dynamic Load Balancing on Parallel Systems

HPCN Europe 1994 Proceedings of the nternational Conference and Exhibition on High-Performance Computing and Networking Volume II: Networking and Tools
A Performance Oriented Migration Framework For The Grid

CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
MPVM: A Migration Transparent Version of PVM

MPVM: A Migration Transparent Version of PVM
Performance Driven Partial Checkpoint/Migrate for LAM-MPI

HPCS '08 Proceedings of the 2008 22nd International Symposium on High Performance Computing Systems and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

When trying to use shared resources for parallel computing, performance guarantees cannot be made. When the load on one node increases, a process running on that node will experience slow down. This can quickly affect overall application performance. Slow running processes can be checkpointed and migrated to more lightly loaded nodes to sustain application performance. To do this, however, it must be possible to; 1) identify affected processes and 2) checkpoint and migrate them independently of other processes which will continue to run. A problem occurs when a slow running process communicates frequently with other processes. In such cases, migrating the single process is insufficient. The communicating processes will quickly block waiting to communicate with the migrating process preventing them from making progress. Also, if a process is migrated "far" from those it communicates with frequently, performance will be adversely affected. To address this problem, we present an approach to identify and group processes which we expect to be frequent communicators in the near future. Then, when one or more process is performing poorly, the entire group is checkpointed and co-migrated. This helps to improve overall application performance in shared resource environments.