Performance implications of failures in large-scale cluster scheduling

  • Authors:
  • Yanyong Zhang;Mark S. Squillante;Anand Sivasubramaniam;Ramendra K. Sahoo

  • Affiliations:
  • Department of Electrical and Computer Engineering, Rutgers University, Piscataway, NJ;Mathematical Sciences Department, IBM T.J. Watson Research Center, Yorktown Heights, NY;Department of Computer Science and Engineering, Pennsylvania State University, University Park, PA;Exploratory Server Systems Department, IBM T.J. Watson Research Center, Yorktown Heights, NY

  • Venue:
  • JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

As we continue to evolve into large-scale parallel systems, many of them employing hundreds of computing engines to take on mission-critical roles, it is crucial to design those systems anticipating and accommodating the occurrence of failures. Failures become a commonplace feature of such large-scale systems, and one cannot continue to treat them as an exception. Despite the current and increasing importance of failures in these systems, our understanding of the performance impact of these critical issues on parallel computing environments is extremely limited. In this paper we develop a general failure modeling framework based on recent results from large-scale clusters and then we exploit this framework to conduct a detailed performance analysis of the impact of failures on system performance for a wide range of scheduling policies. Our results demonstrate that such failures can have a significant impact on the mean job response time and mean job slowdown under existing scheduling policies that ignore failures. We therefore investigate different scheduling mechanisms and policies to address these performance issues. Our results show that periodic checkpointing of jobs seems to do little to ease this problem. On the other hand, we demonstrate that information about the spatial and temporal correlation of failure occurrences can be very useful in designing a scheduling (job allocation) strategy to enhance system performance, with the former providing the greatest benefits.