Grouping MPI Processes for Partial Checkpoint and Co-migration

  • Authors:
  • Rajendra Singh;Peter Graham

  • Affiliations:
  • Dept. of Computer Science, University of Manitoba, Winnipeg, Canada R3T 2N2;Dept. of Computer Science, University of Manitoba, Winnipeg, Canada R3T 2N2

  • Venue:
  • Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

When trying to use shared resources for parallel computing, performance guarantees cannot be made. When the load on one node increases, a process running on that node will experience slow down. This can quickly affect overall application performance. Slow running processes can be checkpointed and migrated to more lightly loaded nodes to sustain application performance. To do this, however, it must be possible to; 1) identify affected processes and 2) checkpoint and migrate them independently of other processes which will continue to run. A problem occurs when a slow running process communicates frequently with other processes. In such cases, migrating the single process is insufficient. The communicating processes will quickly block waiting to communicate with the migrating process preventing them from making progress. Also, if a process is migrated "far" from those it communicates with frequently, performance will be adversely affected. To address this problem, we present an approach to identify and group processes which we expect to be frequent communicators in the near future. Then, when one or more process is performing poorly, the entire group is checkpointed and co-migrated. This helps to improve overall application performance in shared resource environments.