IEEE Transactions on Parallel and Distributed Systems
Processor allocation and checkpoint interval selection in cluster computing systems
Journal of Parallel and Distributed Computing - Special issue on cluster and network-based computing
Enhanced Algorithms for Multi-site Scheduling
GRID '02 Proceedings of the Third International Workshop on Grid Computing
The Performance of Processor Co-Allocation in Multicluster Systems
CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Characterization of Bandwidth-Aware Meta-Schedulers for Co-Allocating Jobs Across Multiple Clusters
The Journal of Supercomputing
An Improved Job Co-Allocation Strategy in Multiple HPC Clusters
HPCS '07 Proceedings of the 21st International Symposium on High Performance Computing Systems and Applications
Backfilling Using System-Generated Predictions Rather than User Runtime Estimates
IEEE Transactions on Parallel and Distributed Systems
PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems
Group-wise performance evaluation of processor co-allocation in multi-cluster systems
JSSPP'07 Proceedings of the 13th international conference on Job scheduling strategies for parallel processing
PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems
Hi-index | 0.00 |
Recent research in multi-site parallel job scheduling leverages user-provided estimates of job communication characteristics to effectively partition the job across multiple clusters. Previous research addressed the impact of inaccuracies in these estimates on overall system performance and found that multi-site scheduling techniques benefit from these estimates, even in the presence of considerable inaccuracy. While these results are encouraging, there are many instances where these errors result in poor scheduling decisions that cause network over-subscription. This situation can lead to significantly degraded application runtime performance and turnaround time. In this paper, we explore the use of job checkpointing to selectively stop offending jobs in order to alleviate network congestion and subsequently restart them when (and where) sufficient network resources are available. We then characterize the conditions and the extent to which checkpointing improves overall performance. We demonstrate that checkpointing is beneficial even when the overhead of doing so is costly.