Highly available systems for database applications
ACM Computing Surveys (CSUR)
Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Recovery Techniques for Database Systems
ACM Computing Surveys (CSUR)
The Recovery Manager of the System R Database Manager
ACM Computing Surveys (CSUR)
Guardians and Actions: Linguistic Support for Robust, Distributed Programs
ACM Transactions on Programming Languages and Systems (TOPLAS)
Byzantine generals in action: implementing fail-stop processors
ACM Transactions on Computer Systems (TOCS)
SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
A message system supporting fault tolerance
SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Publishing: a reliable broadcast communication mechanism
SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Phoenix: a safe in-memory file system
Communications of the ACM
Fault-tolerant computing based on Mach
ACM SIGOPS Operating Systems Review
Understanding fault-tolerant distributed systems
Communications of the ACM
An implementation for small databases with high availability
ACM SIGOPS Operating Systems Review
Transparent optimistic rollback recovery
ACM SIGOPS Operating Systems Review
Some ideas on support for fault tolerance in COMANDOS, an object oriented distributed system
ACM SIGOPS Operating Systems Review
Stable transactional memories and fault tolerant architectures
ACM SIGOPS Operating Systems Review
Restoring consistent global states of distributed computations
PADD '91 Proceedings of the 1991 ACM/ONR workshop on Parallel and distributed debugging
An annotated bibliography of dependable distributed computing
ACM SIGOPS Operating Systems Review
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems.
IEEE Transactions on Parallel and Distributed Systems
A highly available scalable ITV system
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
On the relevance of communication costs of rollback-recovery protocols
Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing
Progressive Retry for Software Failure Recovery in Message-Passing Applications
IEEE Transactions on Computers
Efficient transparent application recovery in client-server information systems
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Persistent messages in local transactions
PODC '98 Proceedings of the seventeenth annual ACM symposium on Principles of distributed computing
Support for Software Interrupts in Log-Based Rollback-Recovery
IEEE Transactions on Computers
Fast cluster failover using virtual memory-mapped communication
ICS '99 Proceedings of the 13th international conference on Supercomputing
Reconfiguration Models and Algorithms for Stateful Interactive Processes
IEEE Transactions on Software Engineering
Replicated condition monitoring
Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
Transparent optimistic rollback recovery
EW 4 Proceedings of the 4th workshop on ACM SIGOPS European workshop
Some ideas on support for fault tolerance in COMANDOS, an object oriented distributed system
EW 4 Proceedings of the 4th workshop on ACM SIGOPS European workshop
Stable transactional memories and fault tolerant architectures
EW 4 Proceedings of the 4th workshop on ACM SIGOPS European workshop
ACM Transactions on Computer Systems (TOCS)
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
ACM Transactions on Computer Systems (TOCS)
ickp: A Consistent Checkpointer for Multicomputers
IEEE Parallel & Distributed Technology: Systems & Technology
Operation Shipping for Mobile File Systems
IEEE Transactions on Computers
Efficient Rollback-Recovery Technique in Distributed Computing Systems
IEEE Transactions on Parallel and Distributed Systems
Performance Evaluation of Fault Tolerance for Parallel Applications in Networked Environments
ICPP '97 Proceedings of the international Conference on Parallel Processing
Fault tolerant matrix operations using checksum and reverse computation
FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Supporting nondeterministic execution in fault-tolerant systems
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Fault Tolerant Matrix Operations for Networks of Workstations Using Multiple Checkpointing
HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
Algorithm-Based Diskless Checkpointing for Fault-Tolerant Matrix Operations
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Completely Asynchronous Optimistic Recovery with Minimal Rollbacks
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Why Optimistic Message Logging Has Not Been Used in Telecommunications Systems
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Fault Tolerance for Off-the-Shelf Applications and Hardware
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
An algorithm for Supporting Fault Tolerant Objects in Distributed Object-Oriented Operating Systems
IWOOOS '95 Proceedings of the 4th International Workshop on Object-Orientation in Operating Systems
Distributed recovery with K-optimistic logging
Journal of Parallel and Distributed Computing
Improving Logging and Recovery Performance in Phoenix/App
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Recovery guarantees for Internet applications
ACM Transactions on Internet Technology (TOIT)
Finding and preventing run-time error handling mistakes
OOPSLA '04 Proceedings of the 19th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Rx: treating bugs as allergies---a safe method to survive software failures
Proceedings of the twentieth ACM symposium on Operating systems principles
Active Replication of Multithreaded Applications
IEEE Transactions on Parallel and Distributed Systems
Rewind, repair, replay: three R's to dependability
EW 10 Proceedings of the 10th workshop on ACM SIGOPS European workshop
Finding a suitable checkpoint and recovery protocol for a distributed application
Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
ACM Transactions on Computer Systems (TOCS)
Undo for operators: building an undoable e-mail store
ATEC '03 Proceedings of the annual conference on USENIX Annual Technical Conference
Flashback: a lightweight extension for rollback and deterministic replay for software debugging
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Treating bugs as allergies: a safe method for surviving software failures
HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
Exploring failure transparency and the limits of generic recovery
OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Automated response using system-call delays
SSYM'00 Proceedings of the 9th conference on USENIX Security Symposium - Volume 9
Operation-based update propagation in a mobile file system
ATEC '99 Proceedings of the annual conference on USENIX Annual Technical Conference
Sweeper: a lightweight end-to-end system for defending against fast worms
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Rx: Treating bugs as allergies—a safe method to survive software failures
ACM Transactions on Computer Systems (TOCS)
Exceptional situations and program reliability
ACM Transactions on Programming Languages and Systems (TOPLAS)
Proceedings of the 6th ACM conference on Computing frontiers
Unstoppable stateful PHP web services
WISE'06 Proceedings of the 7th international conference on Web Information Systems
Research: Designing a system infrastructure for distributed programs
Computer Communications
Ensuring reliability in B2B services: Fault tolerant inter-organizational workflows
Information Systems Frontiers
Hi-index | 0.04 |
The initial design for a distributed, fault-tolerant version of UNIX based on three-way atomic message transmission was presented in an earlier paper [3]. The implementation effort then moved from Auragen Systems1 to Nixdorf Computer where it was completed. This paper describes the working system, now known as the TARGON/32.The original design left open questions in at least two areas: fault tolerance for server processes and recovery after a crash were briefly and inaccurately sketched, rebackup after recovery was not discussed at all. The fundamental design involving three-way message transmission has remained unchanged. However, in addition to important changes in the implementation, server backup has been redesigned and is now more consistent with that of normal user processes. Recovery and rebackup have been completed in a less centralized and thus more efficient manner than previously envisioned.In this paper we review important aspects of the original design and note how the implementation differs from our original ideas. We then focus on the backup and recovery for server processes and the changes and additions in the design and implementation of recovery and rebackup.