Quantifying and Improving the Availability of High-Performance Cluster-Based Internet Services

Authors:
Kiran Nagaraja;Neeraj Krishnan;Ricardo Bianchini;Richard P. Martin;Thu D. Nguyen
Affiliations:
Rutgers University;Rutgers University;Rutgers University;Rutgers University;Rutgers University
Venue:
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Year:
2003

Citing 20
Cited 1

A case for redundant arrays of inexpensive disks (RAID)

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Locality-aware request distribution in cluster-based network servers

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Chameleon: A Software Infrastructure for Adaptive Fault Tolerance

IEEE Transactions on Parallel and Distributed Systems
Manageability, availability and performance in Porcupine: a highly scalable, cluster-based mail service

Proceedings of the seventeenth ACM symposium on Operating systems principles
Efficiency vs. portability in cluster-based network servers

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
SEDA: an architecture for well-conditioned, scalable internet services

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Improving cluster availability using workstation validation

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Analytical and experimental evaluation of cluster-based network servers

World Wide Web
Lessons from Giant-Scale Services

IEEE Internet Computing
Hierarchical Error Detection in a Software Implemented Fault Tolerance (SIFT) Environment

IEEE Transactions on Knowledge and Data Engineering
Joint Evaluation of Performance and Robustness of a COTS DBMS through Fault-Injection

DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Pinpoint: Problem Determination in Large, Dynamic Internet Services

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
An approach towards benchmarking of fault-tolerant commercial systems

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Harvest, Yield, and Scalable Tolerant Systems

HOTOS '99 Proceedings of the The Seventh Workshop on Hot Topics in Operating Systems
Failure Data Analysis of a LAN of Windows NT Based Computers

SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
An Approach for Estimation of Software Aging in a Web Server

ISESE '02 Proceedings of the 2002 International Symposium on Empirical Software Engineering
User-Level Communication in Cluster-Based Servers

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Reducing the Cost of System Administration of a Disk Storage System

Reducing the Cost of System Administration of a Disk Storage System
Scalable, distributed data structures for internet service construction

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Using fault injection and modeling to evaluate the performability of cluster-based services

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4

Dynamic service placement and replication framework to enhance service availability using team formation algorithm

Journal of Systems and Software

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cluster-based servers can substantially increase performance when nodes cooperate to globally manage resources. However, in this paper we show that cooperation results in a substantial availability loss, in the absence of high-availability mechanisms. Specifically, we show that a sophisticated cluster-based Web server, which gains a factor of 3 in performance through cooperation, increases service unavailability by a factor of 10 over a non-cooperative version. We then show how to augment this Web server with software components embodying a small set of high-availability techniques to regain the lost availability. Among other interesting observations, we show that the application of multiple high-availability techniques, each implemented independently in its own subsystem, can lead to inconsistent recovery actions. We also show that a novel technique called Fault Model Enforcement can be used to resolve such inconsistencies. Augmenting the server with these techniques led to a final expected availability of close to 99.99%.