PV-EASY: a strict fairness guaranteed and prediction enabled scheduler in parallel job scheduling
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Job failures in high performance computing systems: A large-scale empirical study
Computers & Mathematics with Applications
Improvements to the structural simulation toolkit
Proceedings of the 5th International ICST Conference on Simulation Tools and Techniques
Hi-index | 0.01 |
In this paper, we examine the concept of giving every job a trial run before committing it to run until completion. Trial runs allow immediate job failures to be detected shortly after job submission and benefit short jobs by letting them run and finish early. This occurs without inflicting a significant penalty on longer jobs, whose average and maximum waiting time are actually improved in some cases. The strategy does not require preemption and instead uses the ability to kill and restart a job from the beginning, which it does at most once for each job. While others have proposed similar strategies, our algorithm is distinguished by its determination to give each job a fixed-length trial run as soon as possible. Our study is also more focused, including a detailed description of the algorithm and an examination of the effect of varying the length of a trial run.