Performability Analysis: Measures, an Algorithm, and a Case Study
IEEE Transactions on Computers - Fault-Tolerant Computing
Cluster-based scalable network services
Proceedings of the sixteenth ACM symposium on Operating systems principles
Analysis of Preventive Maintenance in Transactions Based Software Systems
IEEE Transactions on Computers
Proceedings of the seventeenth ACM symposium on Operating systems principles
The Ninja architecture for robust Internet-scale systems and services373423
Computer Networks: The International Journal of Computer and Telecommunications Networking - pervasive computing
Efficiency vs. portability in cluster-based network servers
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Improving cluster availability using workstation validation
SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Transaction Processing: Concepts and Techniques
Transaction Processing: Concepts and Techniques
Lessons from Giant-Scale Services
IEEE Internet Computing
An approach towards benchmarking of fault-tolerant commercial systems
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Evaluating the Impact of Communication Architecture on the Performability of Cluster-Based Services
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Failure Data Analysis of a LAN of Windows NT Based Computers
SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Performability evaluation: where it is and what lies ahead
IPDS '95 Proceedings of the International Computer Performance and Dependability Symposium on Computer Performance and Dependability Symposium
Reducing the Cost of System Administration of a Disk Storage System
Reducing the Cost of System Administration of a Disk Storage System
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Integrated resource management for cluster-based internet services
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Scalable, distributed data structures for internet service construction
OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Using fault injection and modeling to evaluate the performability of cluster-based services
USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Conflict-aware scheduling for dynamic content applications
USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Towards availability benchmarks: a case study of software raid systems
ATEC '00 Proceedings of the annual conference on USENIX Annual Technical Conference
Scalable content-aware request distribution in cluster-based networks servers
ATEC '00 Proceedings of the annual conference on USENIX Annual Technical Conference
Human-aware computer system design
HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
Automatic configuration of internet services
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
IEEE Transactions on Parallel and Distributed Systems
Heartbeat based fault diagnosis for mobile ad-hoc network
ACST'07 Proceedings of the third conference on IASTED International Conference: Advances in Computer Science and Technology
Hi-index | 0.00 |
In this paper, we propose a two-phase methodology for systematically evaluating the performability (performance and availability) of cluster-based Internet services. In the first phase, evaluators use a fault-injection infrastructure to characterize the service's behavior in the presence of faults. In the second phase, evaluators use an analytical model to combine an expected fault load with measurements from the first phase to assess the service's performability. Using this model, evaluators can study the service's sensitivity to different design decisions, fault rates, and other environmental factors. To demonstrate our methodology, we study the performability of a multitier Internet service. In particular, we evaluate the performance and availability of three soft state maintenance strategies for an online bookstore service in the presence of seven classes of faults. Among other interesting results, we clearly isolate the effect of different faults, showing that the tier of Web servers is responsible for an often dominant fraction of the service unavailability. Our results also demonstrate that storing the soft state in a database achieves better performability than storing it in main memory (even when the state is efficiently replicated) when we weight performance and availability equally. Based on our results, we conclude that service designers may want an unbalanced system in which they heavily load highly available components and leave more spare capacity for components that are likely to fail more often.