Quantifying the Performability of Cluster-Based Services

Authors:
Kiran Nagaraja;Gustavo Gama;Ricardo Bianchini;Richard P. Martin;Wagner Meira Jr.;Thu D. Nguyen
Affiliations:
IEEE;-;IEEE;IEEE;-;IEEE Computer Society
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
2005

Citing 21
Cited 4

Performability Analysis: Measures, an Algorithm, and a Case Study

IEEE Transactions on Computers - Fault-Tolerant Computing
Cluster-based scalable network services

Proceedings of the sixteenth ACM symposium on Operating systems principles
Analysis of Preventive Maintenance in Transactions Based Software Systems

IEEE Transactions on Computers
Manageability, availability and performance in Porcupine: a highly scalable, cluster-based mail service

Proceedings of the seventeenth ACM symposium on Operating systems principles
The Ninja architecture for robust Internet-scale systems and services373423

Computer Networks: The International Journal of Computer and Telecommunications Networking - pervasive computing
Efficiency vs. portability in cluster-based network servers

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Improving cluster availability using workstation validation

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Transaction Processing: Concepts and Techniques

Transaction Processing: Concepts and Techniques
Lessons from Giant-Scale Services

IEEE Internet Computing
An approach towards benchmarking of fault-tolerant commercial systems

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Evaluating the Impact of Communication Architecture on the Performability of Cluster-Based Services

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Failure Data Analysis of a LAN of Windows NT Based Computers

SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Performability evaluation: where it is and what lies ahead

IPDS '95 Proceedings of the International Computer Performance and Dependability Symposium on Computer Performance and Dependability Symposium
Reducing the Cost of System Administration of a Disk Storage System

Reducing the Cost of System Administration of a Disk Storage System
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,

Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Integrated resource management for cluster-based internet services

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Scalable, distributed data structures for internet service construction

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Using fault injection and modeling to evaluate the performability of cluster-based services

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Conflict-aware scheduling for dynamic content applications

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Towards availability benchmarks: a case study of software raid systems

ATEC '00 Proceedings of the annual conference on USENIX Annual Technical Conference
Scalable content-aware request distribution in cluster-based networks servers

ATEC '00 Proceedings of the annual conference on USENIX Annual Technical Conference

Human-aware computer system design

HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
Automatic configuration of internet services

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Analysis and optimization of service availability in a HA cluster with load-dependent machine availability

IEEE Transactions on Parallel and Distributed Systems
Heartbeat based fault diagnosis for mobile ad-hoc network

ACST'07 Proceedings of the third conference on IASTED International Conference: Advances in Computer Science and Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose a two-phase methodology for systematically evaluating the performability (performance and availability) of cluster-based Internet services. In the first phase, evaluators use a fault-injection infrastructure to characterize the service's behavior in the presence of faults. In the second phase, evaluators use an analytical model to combine an expected fault load with measurements from the first phase to assess the service's performability. Using this model, evaluators can study the service's sensitivity to different design decisions, fault rates, and other environmental factors. To demonstrate our methodology, we study the performability of a multitier Internet service. In particular, we evaluate the performance and availability of three soft state maintenance strategies for an online bookstore service in the presence of seven classes of faults. Among other interesting results, we clearly isolate the effect of different faults, showing that the tier of Web servers is responsible for an often dominant fraction of the service unavailability. Our results also demonstrate that storing the soft state in a database achieves better performability than storing it in main memory (even when the state is efficiently replicated) when we weight performance and availability equally. Based on our results, we conclude that service designers may want an unbalanced system in which they heavily load highly available components and leave more spare capacity for components that are likely to fail more often.