Using fault injection and modeling to evaluate the performability of cluster-based services

Authors:
Kiran Nagaraja;Xiaoyan Li;Ricardo Bianchini;Richard P. Martin;Thu D. Nguyen
Affiliations:
Department of Computer Science, Rutgers University, Piscataway, NJ;Department of Computer Science, Rutgers University, Piscataway, NJ;Department of Computer Science, Rutgers University, Piscataway, NJ;Department of Computer Science, Rutgers University, Piscataway, NJ;Department of Computer Science, Rutgers University, Piscataway, NJ
Venue:
USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Year:
2003

Citing 22
Cited 13

Performability Analysis: Measures, an Algorithm, and a Case Study

IEEE Transactions on Computers - Fault-Tolerant Computing
Analysis and Modeling of Correlated Failures in Multicomputer Systems

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Analysis of Preventive Maintenance in Transactions Based Software Systems

IEEE Transactions on Computers
Locality-aware request distribution in cluster-based network servers

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Manageability, availability and performance in Porcupine: a highly scalable, cluster-based mail service

Proceedings of the seventeenth ACM symposium on Operating systems principles
Efficiency vs. portability in cluster-based network servers

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Analytical and experimental evaluation of cluster-based network servers

World Wide Web
Lessons from Giant-Scale Services

IEEE Internet Computing
An approach towards benchmarking of fault-tolerant commercial systems

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Harvest, Yield, and Scalable Tolerant Systems

HOTOS '99 Proceedings of the The Seventh Workshop on Hot Topics in Operating Systems
Evaluating the Impact of Communication Architecture on the Performability of Cluster-Based Services

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Networked Windows NT System Field Failure Data Analysis

PRDC '99 Proceedings of the 1999 Pacific Rim International Symposium on Dependable Computing
Comparing Operating Systems Using Robustness Benchmarks

SRDS '97 Proceedings of the 16th Symposium on Reliable Distributed Systems
A Software Multilevel Fault Injection Mechanism: Case Study Evaluating the Virtual Interface Architecture

SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Failure Data Analysis of a LAN of Windows NT Based Computers

SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
An Approach for Estimation of Software Aging in a Web Server

ISESE '02 Proceedings of the 2002 International Symposium on Empirical Software Engineering
User-Level Communication in Cluster-Based Servers

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Performability evaluation: where it is and what lies ahead

IPDS '95 Proceedings of the International Computer Performance and Dependability Symposium on Computer Performance and Dependability Symposium
Reducing the Cost of System Administration of a Disk Storage System

Reducing the Cost of System Administration of a Disk Storage System
Scalable, distributed data structures for internet service construction

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Towards availability benchmarks: a case study of software raid systems

ATEC '00 Proceedings of the annual conference on USENIX Annual Technical Conference
Scalable content-aware request distribution in cluster-based networks servers

ATEC '00 Proceedings of the annual conference on USENIX Annual Technical Conference

Supporting Cluster-Based Network Services on Functionally Symmetric Software Architecture

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Quantifying and Improving the Availability of High-Performance Cluster-Based Internet Services

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Quantifying the Performability of Cluster-Based Services

IEEE Transactions on Parallel and Distributed Systems
An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Robustness Testing of Java Server Applications

IEEE Transactions on Software Engineering
SPEK: A Storage Performance Evaluation Kernel Module for Block-Level Storage Systems under Faulty Conditions

IEEE Transactions on Dependable and Secure Computing
Navigating error recovery code in Java applications

eclipse '05 Proceedings of the 2005 OOPSLA workshop on Eclipse technology eXchange
Why do internet services fail, and what can be done about it?

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Toward Exascale Resilience

International Journal of High Performance Computing Applications
Dependency-aware maintenance for highly available service-oriented grid

Journal of Systems and Software
Fast black-box testing of system recovery code

Proceedings of the 7th ACM european conference on Computer Systems
DRO+: a systemic and economical approach to improve availability of massive database systems

WISE'06 Proceedings of the 7th international conference on Web Information Systems
Performability analysis of storage systems in practice: methodology and tools

ISAS'06 Proceedings of the Third international conference on Service Availability

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a two-phase methodology for quantifying the performability (performance and availability) of cluster-based Internet services. In the first phase, evaluators use a fault-injection infrastructure to measure the impact of faults on the server's performance. In the second phase, evaluators use an analytical model to combine an expected fault load with measurements from the first phase to assess the server's performability. Using this model, evaluators can study the server's sensitivity to different design decisions, fault rates, and environmental factors. To demonstrate our methodology, we study the performability of 4 versions of the PRESS Web server against 5 classes of faults, quantifying the effects of different design decisions on performance and availability. Finally, to further show the utility of our model, we also quantify the impact of two hypothetical changes, reduced human operator response time and the use of RAIDs.