High-Availability Computer Systems

Authors:
Jim Gray;Daniel P. Siewiorek
Affiliations:
-;-
Venue:
Computer
Year:
1991

Citing 6
Cited 40

Fault-tolerant computing: theory and techniques; vol. 1

Fault-tolerant computing: theory and techniques; vol. 1
Operating systems: design and implementation

Operating systems: design and implementation
Design & analysis of fault tolerant digital systems

Design & analysis of fault tolerant digital systems
Reliable computer systems (2nd ed.): design and evaluation

Reliable computer systems (2nd ed.): design and evaluation
Reliability Issues in Computing System Design

ACM Computing Surveys (CSUR)
A NonStop kernel

SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles

Minimizing completion time of a program by checkpointing and rejuvenation

Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Building reliable mobile-aware applications using the Rover toolkit

MobiCom '96 Proceedings of the 2nd annual international conference on Mobile computing and networking
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints

IEEE Transactions on Computers
Progressive Retry for Software Failure Recovery in Message-Passing Applications

IEEE Transactions on Computers
Building reliable mobile-aware applications using the Rover toolkit

Wireless Networks - Special issue: mobile computing and networking: selected papers from MobiCom '96
Analysis of Preventive Maintenance in Transactions Based Software Systems

IEEE Transactions on Computers
Persistent messages in local transactions

PODC '98 Proceedings of the seventeenth annual ACM symposium on Principles of distributed computing
The Design and Verification of the Rio File Cache

IEEE Transactions on Computers
Analysis and implementation of software rejuvenation in cluster systems

Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
BASE: using abstraction to improve fault tolerance

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Understanding Fault Tolerance and Reliability

Computer
Software Reliability and Rejuvenation: Modeling and Analysis

Performance Evaluation of Complex Systems: Techniques and Tools, Performance 2002, Tutorial Lectures
Overview of Digital UNIX Cluster Architecture

COMPCON '96 Proceedings of the 41st IEEE International Computer Conference
A Measurement-Based Model for Estimation of Resource Exhaustion in Operational Software Systems

ISSRE '99 Proceedings of the 10th International Symposium on Software Reliability Engineering
BASE: Using abstraction to improve fault tolerance

ACM Transactions on Computer Systems (TOCS)
Systems of systems and coordinated atomic actions

ACM SIGSOFT Software Engineering Notes
HANet: a framework toward ultimately reliable network services

Journal of Systems and Software
Performability analysis of clustered systems with rejuvenation under varying workload

Performance Evaluation
Modeling and analysis of software aging and software failure

Journal of Systems and Software
Software fault tolerant computing: needs and prospects

Ubiquity
AFRAID: a frequently redundant array of independent disks

ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
Environmentally responsible middleware:: an altruistic behavior model for distributed middleware components

Proceedings of the 16th international symposium on High performance distributed computing
Availability analysis of application servers using software rejuvenation and virtualization

Journal of Computer Science and Technology
Availability analysis of blade server systems

IBM Systems Journal
Proactive management of software aging

IBM Journal of Research and Development
Discrete-time cost analysis for a telecommunication billing application with rejuvenation

Computers & Mathematics with Applications
Self-configuring algorithm for software fault tolerance in (n,k)-way cluster systems

ICCSA'03 Proceedings of the 2003 international conference on Computational science and its applications: PartI
User-perceived software service availability modeling with reliability growth

ISAS'08 Proceedings of the 5th international conference on Service availability
Semi-Markov performance modelling of a redundant system with partial, full and failed rejuvenation

International Journal of Critical Computer-Based Systems
An analysis of the Ariane 5 flight 501 failure - a system engineering perspective

ECBS'97 Proceedings of the 1997 international conference on Engineering of computer-based systems
Towards context-aware adaptive fault tolerance in SOA applications

Proceedings of the 5th ACM international conference on Distributed event-based system
Replication predicates for dependent-failure algorithms

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Evaluation of the device driver availability in dawning4000a

GPC'06 Proceedings of the First international conference on Advances in Grid and Pervasive Computing
An Architecture for High Availability Multi-user Systems

Computer Communications
A proactive approach towards always-on availability in broadband cable networks

Computer Communications
Extending TPC-E to measure availability in database systems

TPCTC'11 Proceedings of the Third TPC Technology conference on Topics in Performance Evaluation, Measurement and Characterization
Investigating dynamic reliability and availability through state-space models

Computers & Mathematics with Applications
EA-Analyzer: automating conflict detection in a large set of textual aspect-oriented requirements

Automated Software Engineering
Scheduling highly available applications on cloud environments

Future Generation Computer Systems
Process fragmentation, distribution and execution using an event-based interaction scheme

Journal of Systems and Software

Quantified Score

Hi-index	4.11

Visualization

Abstract

The techniques used to build highly available computer systems are sketched. Historical background is provided, and terminology is defined. Empirical experience with computer failure is briefly discussed. Device improvements that have greatly increased the reliability of digital electronics are identified. Fault-tolerant design concepts and approaches to fault-tolerant hardware are outlined. The role of repair and maintenance and of design-fault tolerance is discussed. Software repair is considered. The use of pairs of computer systems at separate locations to guard against unscheduled outages due to outside sources (communication or power failures, earthquakes, etc.) is addressed.