Performance implications of failures in large-scale cluster scheduling

Authors:
Yanyong Zhang;Mark S. Squillante;Anand Sivasubramaniam;Ramendra K. Sahoo
Affiliations:
Department of Electrical and Computer Engineering, Rutgers University, Piscataway, NJ;Mathematical Sciences Department, IBM T.J. Watson Research Center, Yorktown Heights, NY;Department of Computer Science and Engineering, Pennsylvania State University, University Park, PA;Exploratory Server Systems Department, IBM T.J. Watson Research Center, Yorktown Heights, NY
Venue:
JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Year:
2004

Citing 24
Cited 17

On the Reliability of the IBM MVS/XA Operating System

IEEE Transactions on Software Engineering
Task Allocation for Maximizing Reliability of Distributed Computer Systems

IEEE Transactions on Computers
Fault-tolerant scheduling

STOC '94 Proceedings of the twenty-sixth annual ACM symposium on Theory of computing
Terrestrial cosmic rays

IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Task Allocation Algorithms for Maximizing Reliability of Distributed Computing Systems

IEEE Transactions on Computers
Optimal policy for batch operations: backup, checkpointing, reorganization, and updating

ACM Transactions on Database Systems (TODS)
An evaluation of parallel job scheduling for ASCI Blue-Pacific

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Scheduling with unexpected machine breakdowns

Discrete Applied Mathematics
Analysis and implementation of software rejuvenation in cluster systems

Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A Variational Calculus Approach to Optimal Checkpoint Placement

IEEE Transactions on Computers
Modeling and analysis of dynamic coscheduling in parallel and distributed environments

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Improving cluster availability using workstation validation

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Drowsy caches: simple techniques for reducing leakage power

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Processor allocation and checkpoint interval selection in cluster computing systems

Journal of Parallel and Distributed Computing - Special issue on cluster and network-based computing
The Impact of Migration on Parallel Job Scheduling for Distributed Systems

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
An Integrated Approach to Parallel Scheduling Using Gang-Scheduling, Backfilling, and Migration

IEEE Transactions on Parallel and Distributed Systems
A comparative analysis of event tupling schemes

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Improving Parallel Job Scheduling by Combining Gang Scheduling and Backfilling Techniques

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
VAX/VMS Event Monitoring and Analysis

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Critical event prediction for proactive management in large-scale computer clusters

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Failure Data Analysis of a Large-Scale Heterogeneous Server Environment

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Effect of System Workload on Operating System Reliability: A Study on IBM 3081

IEEE Transactions on Software Engineering

Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Failure-aware checkpointing in fine-grained cycle sharing systems

Proceedings of the 16th international symposium on High performance distributed computing
Using queue structures to improve job reliability

Proceedings of the 16th international symposium on High performance distributed computing
Providing Fault-Tolerance in Unreliable Grid Systems Through Adaptive Checkpointing and Replication

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part I: ICCS 2007
Dynamic Grid Scheduling Using Job Runtime Requirements and Variable Resource Availability

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
On the dynamic resource availability in grids

GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
An analysis of clustered failures on large supercomputing systems

Journal of Parallel and Distributed Computing
Towards resilient high performance applications through real time reliability metric generation and autonomous failure correction

Proceedings of the 2009 workshop on Resiliency in high performance
Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Current research and practice in proactive fault management

International Journal of Computers and Applications
Failure-aware resource management for high-availability computing clusters with distributed virtual machines

Journal of Parallel and Distributed Computing
A study of dynamic meta-learning for failure prediction in large-scale systems

Journal of Parallel and Distributed Computing
A model for space-correlated failures in large-scale distributed systems

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Failure-aware workflow scheduling in cluster environments

Cluster Computing
The importance of complete data sets for job scheduling simulations

JSSPP'10 Proceedings of the 15th international conference on Job scheduling strategies for parallel processing
Job failures in high performance computing systems: A large-scale empirical study

Computers & Mathematics with Applications
Dependable Grid Workflow Scheduling Based on Resource Availability

Journal of Grid Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

As we continue to evolve into large-scale parallel systems, many of them employing hundreds of computing engines to take on mission-critical roles, it is crucial to design those systems anticipating and accommodating the occurrence of failures. Failures become a commonplace feature of such large-scale systems, and one cannot continue to treat them as an exception. Despite the current and increasing importance of failures in these systems, our understanding of the performance impact of these critical issues on parallel computing environments is extremely limited. In this paper we develop a general failure modeling framework based on recent results from large-scale clusters and then we exploit this framework to conduct a detailed performance analysis of the impact of failures on system performance for a wide range of scheduling policies. Our results demonstrate that such failures can have a significant impact on the mean job response time and mean job slowdown under existing scheduling policies that ignore failures. We therefore investigate different scheduling mechanisms and policies to address these performance issues. Our results show that periodic checkpointing of jobs seems to do little to ease this problem. On the other hand, we demonstrate that information about the spatial and temporal correlation of failure occurrences can be very useful in designing a scheduling (job allocation) strategy to enhance system performance, with the former providing the greatest benefits.