Fault-tolerant computing: theory and techniques; vol. 1
Fault-tolerant computing: theory and techniques; vol. 1
Operating systems: design and implementation
Operating systems: design and implementation
Design & analysis of fault tolerant digital systems
Design & analysis of fault tolerant digital systems
Reliable computer systems (2nd ed.): design and evaluation
Reliable computer systems (2nd ed.): design and evaluation
Reliability Issues in Computing System Design
ACM Computing Surveys (CSUR)
SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
Minimizing completion time of a program by checkpointing and rejuvenation
Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Building reliable mobile-aware applications using the Rover toolkit
MobiCom '96 Proceedings of the 2nd annual international conference on Mobile computing and networking
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints
IEEE Transactions on Computers
Progressive Retry for Software Failure Recovery in Message-Passing Applications
IEEE Transactions on Computers
Building reliable mobile-aware applications using the Rover toolkit
Wireless Networks - Special issue: mobile computing and networking: selected papers from MobiCom '96
Analysis of Preventive Maintenance in Transactions Based Software Systems
IEEE Transactions on Computers
Persistent messages in local transactions
PODC '98 Proceedings of the seventeenth annual ACM symposium on Principles of distributed computing
The Design and Verification of the Rio File Cache
IEEE Transactions on Computers
Analysis and implementation of software rejuvenation in cluster systems
Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
BASE: using abstraction to improve fault tolerance
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Software Reliability and Rejuvenation: Modeling and Analysis
Performance Evaluation of Complex Systems: Techniques and Tools, Performance 2002, Tutorial Lectures
Overview of Digital UNIX Cluster Architecture
COMPCON '96 Proceedings of the 41st IEEE International Computer Conference
A Measurement-Based Model for Estimation of Resource Exhaustion in Operational Software Systems
ISSRE '99 Proceedings of the 10th International Symposium on Software Reliability Engineering
BASE: Using abstraction to improve fault tolerance
ACM Transactions on Computer Systems (TOCS)
Systems of systems and coordinated atomic actions
ACM SIGSOFT Software Engineering Notes
HANet: a framework toward ultimately reliable network services
Journal of Systems and Software
Performability analysis of clustered systems with rejuvenation under varying workload
Performance Evaluation
Modeling and analysis of software aging and software failure
Journal of Systems and Software
AFRAID: a frequently redundant array of independent disks
ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
Proceedings of the 16th international symposium on High performance distributed computing
Availability analysis of application servers using software rejuvenation and virtualization
Journal of Computer Science and Technology
Availability analysis of blade server systems
IBM Systems Journal
Proactive management of software aging
IBM Journal of Research and Development
Discrete-time cost analysis for a telecommunication billing application with rejuvenation
Computers & Mathematics with Applications
Self-configuring algorithm for software fault tolerance in (n,k)-way cluster systems
ICCSA'03 Proceedings of the 2003 international conference on Computational science and its applications: PartI
User-perceived software service availability modeling with reliability growth
ISAS'08 Proceedings of the 5th international conference on Service availability
Semi-Markov performance modelling of a redundant system with partial, full and failed rejuvenation
International Journal of Critical Computer-Based Systems
An analysis of the Ariane 5 flight 501 failure - a system engineering perspective
ECBS'97 Proceedings of the 1997 international conference on Engineering of computer-based systems
Towards context-aware adaptive fault tolerance in SOA applications
Proceedings of the 5th ACM international conference on Distributed event-based system
Replication predicates for dependent-failure algorithms
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Evaluation of the device driver availability in dawning4000a
GPC'06 Proceedings of the First international conference on Advances in Grid and Pervasive Computing
An Architecture for High Availability Multi-user Systems
Computer Communications
A proactive approach towards always-on availability in broadband cable networks
Computer Communications
Extending TPC-E to measure availability in database systems
TPCTC'11 Proceedings of the Third TPC Technology conference on Topics in Performance Evaluation, Measurement and Characterization
Investigating dynamic reliability and availability through state-space models
Computers & Mathematics with Applications
EA-Analyzer: automating conflict detection in a large set of textual aspect-oriented requirements
Automated Software Engineering
Scheduling highly available applications on cloud environments
Future Generation Computer Systems
Process fragmentation, distribution and execution using an event-based interaction scheme
Journal of Systems and Software
Hi-index | 4.11 |
The techniques used to build highly available computer systems are sketched. Historical background is provided, and terminology is defined. Empirical experience with computer failure is briefly discussed. Device improvements that have greatly increased the reliability of digital electronics are identified. Fault-tolerant design concepts and approaches to fault-tolerant hardware are outlined. The role of repair and maintenance and of design-fault tolerance is discussed. Software repair is considered. The use of pairs of computer systems at separate locations to guard against unscheduled outages due to outside sources (communication or power failures, earthquakes, etc.) is addressed.