Improving cluster availability using workstation validation

Authors:
Taliver Heath;Richard P. Martin;Thu D. Nguyen
Affiliations:
Rutgers University, Piscataway, NJ;Rutgers University, Piscataway, NJ;Rutgers University, Piscataway, NJ
Venue:
SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Year:
2002

Citing 13
Cited 24

Cluster-based scalable network services

Proceedings of the sixteenth ACM symposium on Operating systems principles
Analysis and implementation of software rejuvenation in cluster systems

Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Lessons from Giant-Scale Services

IEEE Internet Computing
How Fail-Stop are Faulty Programs?

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Harvest, Yield, and Scalable Tolerant Systems

HOTOS '99 Proceedings of the The Seventh Workshop on Hot Topics in Operating Systems
Networked Windows NT System Field Failure Data Analysis

PRDC '99 Proceedings of the 1999 Pacific Rim International Symposium on Dependable Computing
Failure Data Analysis of a LAN of Windows NT Based Computers

SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
A Measurement-Based Model for Estimation of Resource Exhaustion in Operational Software Systems

ISSRE '99 Proceedings of the 10th International Symposium on Software Reliability Engineering
Software Rejuvenation: Analysis, Module and Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel

HOTOS '01 Proceedings of the Eighth Workshop on Hot Topics in Operating Systems
An Analysis of Error Behaviour in a Large Storage System

An Analysis of Error Behaviour in a Large Storage System
Reducing the Cost of System Administration of a Disk Storage System

Reducing the Cost of System Administration of a Disk Storage System
Towards availability benchmarks: a case study of software raid systems

ATEC '00 Proceedings of the annual conference on USENIX Annual Technical Conference

Evaluating the Impact of Communication Architecture on the Performability of Cluster-Based Services

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Quantifying and Improving the Availability of High-Performance Cluster-Based Internet Services

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Quantifying the Performability of Cluster-Based Services

IEEE Transactions on Parallel and Distributed Systems
A performance model of highly available multicomputer systems

International Journal of Modelling and Simulation
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Using queue structures to improve job reliability

Proceedings of the 16th international symposium on High performance distributed computing
Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?

ACM Transactions on Storage (TOS)
Exploring event correlation for failure prediction in coalitions of clusters

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Grid workflow scheduling based on reliability cost

Proceedings of the 2nd international conference on Scalable information systems
DGSS: A Dependability Guided Job Scheduling System for Grid Environment

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part I: ICCS 2007
An analysis of clustered failures on large supercomputing systems

Journal of Parallel and Distributed Computing
Towards resilient high performance applications through real time reliability metric generation and autonomous failure correction

Proceedings of the 2009 workshop on Resiliency in high performance
Quantifying event correlations for proactive failure management in networked computing systems

Journal of Parallel and Distributed Computing
A realistic evaluation of memory hardware errors and software system susceptibility

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
A model for space-correlated failures in large-scale distributed systems

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
The importance of complete data sets for job scheduling simulations

JSSPP'10 Proceedings of the 15th international conference on Job scheduling strategies for parallel processing
Checkpointing strategies for parallel jobs

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Job failures in high performance computing systems: A large-scale empirical study

Computers & Mathematics with Applications
Failure data-driven selective node-level duplication to improve MTTF in high performance computing systems

HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
Performance implications of failures in large-scale cluster scheduling

JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Dependable Grid Workflow Scheduling Based on Resource Availability

Journal of Grid Computing
Reliability model of a system of k nodes with simultaneous failures for high-performance computing applications

International Journal of High Performance Computing Applications
Checkpointing algorithms and fault prediction

Journal of Parallel and Distributed Computing
Resource failures risk assessment modelling in distributed environments

Journal of Systems and Software

Quantified Score

Hi-index	0.00

Visualization

Abstract

We demonstrate a framework for improving the availability of cluster based Internet services. Our approach models Internet services as a collection of interconnected components, each possessing well defined interfaces and failure semantics. Such a decomposition allows designers to engineer high availability based on an understanding of the interconnections and isolated fault behavior of each component, as opposed to ad-hoc methods. In this work, we focus on using the entire commodity workstation as a component because it possesses natural, fault-isolated interfaces. We define a failure event as a reboot because not only is a workstation unavailable during a reboot, but also because reboots are symptomatic of a larger class of failures, such as configuration and operator errors. Our observations of 3 distinct clusters show that the time between reboots is best modeled by a Weibull distribution with shape parameters of less than 1, implying that a workstation becomes more reliable the longer it has been operating. Leveraging this observed property, we design an allocation strategy which withholds recently rebooted workstations from active service, validating their stability before allowing them to return to service. We show via simulation that this policy leads to a 70-30 rule-of-thumb: For a constant utilization, approximately 70% of the workstation failures can be masked from end clients with 30% extra capacity added to the cluster, provided reboots are not strongly correlated. We also found our technique is most sensitive to the burstiness of reboots as opposed to absolute lengths of workstation uptimes.