A case for redundant arrays of inexpensive disks (RAID)
SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Locality-aware request distribution in cluster-based network servers
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Chameleon: A Software Infrastructure for Adaptive Fault Tolerance
IEEE Transactions on Parallel and Distributed Systems
Proceedings of the seventeenth ACM symposium on Operating systems principles
Efficiency vs. portability in cluster-based network servers
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
SEDA: an architecture for well-conditioned, scalable internet services
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Improving cluster availability using workstation validation
SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Lessons from Giant-Scale Services
IEEE Internet Computing
Hierarchical Error Detection in a Software Implemented Fault Tolerance (SIFT) Environment
IEEE Transactions on Knowledge and Data Engineering
Joint Evaluation of Performance and Robustness of a COTS DBMS through Fault-Injection
DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Pinpoint: Problem Determination in Large, Dynamic Internet Services
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
An approach towards benchmarking of fault-tolerant commercial systems
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Harvest, Yield, and Scalable Tolerant Systems
HOTOS '99 Proceedings of the The Seventh Workshop on Hot Topics in Operating Systems
Failure Data Analysis of a LAN of Windows NT Based Computers
SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
An Approach for Estimation of Software Aging in a Web Server
ISESE '02 Proceedings of the 2002 International Symposium on Empirical Software Engineering
User-Level Communication in Cluster-Based Servers
HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Reducing the Cost of System Administration of a Disk Storage System
Reducing the Cost of System Administration of a Disk Storage System
Scalable, distributed data structures for internet service construction
OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Using fault injection and modeling to evaluate the performability of cluster-based services
USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Journal of Systems and Software
Hi-index | 0.00 |
Cluster-based servers can substantially increase performance when nodes cooperate to globally manage resources. However, in this paper we show that cooperation results in a substantial availability loss, in the absence of high-availability mechanisms. Specifically, we show that a sophisticated cluster-based Web server, which gains a factor of 3 in performance through cooperation, increases service unavailability by a factor of 10 over a non-cooperative version. We then show how to augment this Web server with software components embodying a small set of high-availability techniques to regain the lost availability. Among other interesting observations, we show that the application of multiple high-availability techniques, each implemented independently in its own subsystem, can lead to inconsistent recovery actions. We also show that a novel technique called Fault Model Enforcement can be used to resolve such inconsistencies. Augmenting the server with these techniques led to a final expected availability of close to 99.99%.