Using checkpointing to recover from poor multi-site parallel job scheduling decisions

Authors:
William M. Jones
Affiliations:
United States Naval Academy, Annapolis, MD
Venue:
Proceedings of the 5th international workshop on Middleware for grid computing: held at the ACM/IFIP/USENIX 8th International Middleware Conference
Year:
2007

Citing 10
Cited 1

Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling

IEEE Transactions on Parallel and Distributed Systems
Processor allocation and checkpoint interval selection in cluster computing systems

Journal of Parallel and Distributed Computing - Special issue on cluster and network-based computing
Enhanced Algorithms for Multi-site Scheduling

GRID '02 Proceedings of the Third International Workshop on Grid Computing
The Performance of Processor Co-Allocation in Multicluster Systems

CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Characterization of Bandwidth-Aware Meta-Schedulers for Co-Allocating Jobs Across Multiple Clusters

The Journal of Supercomputing
An Improved Job Co-Allocation Strategy in Multiple HPC Clusters

HPCS '07 Proceedings of the 21st International Symposium on High Performance Computing Systems and Applications
Backfilling Using System-Generated Predictions Rather than User Runtime Estimates

IEEE Transactions on Parallel and Distributed Systems
The impact of error in user-provided bandwidth estimates on multi-site parallel job scheduling performance

PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems
Group-wise performance evaluation of processor co-allocation in multi-cluster systems

JSSPP'07 Proceedings of the 13th international conference on Job scheduling strategies for parallel processing

The impact of error in user-provided bandwidth estimates on multi-site parallel job scheduling performance

PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent research in multi-site parallel job scheduling leverages user-provided estimates of job communication characteristics to effectively partition the job across multiple clusters. Previous research addressed the impact of inaccuracies in these estimates on overall system performance and found that multi-site scheduling techniques benefit from these estimates, even in the presence of considerable inaccuracy. While these results are encouraging, there are many instances where these errors result in poor scheduling decisions that cause network over-subscription. This situation can lead to significantly degraded application runtime performance and turnaround time. In this paper, we explore the use of job checkpointing to selectively stop offending jobs in order to alleviate network congestion and subsequently restart them when (and where) sufficient network resources are available. We then characterize the conditions and the extent to which checkpointing improves overall performance. We demonstrate that checkpointing is beneficial even when the overhead of doing so is costly.