Checkpointing and Rollback-Recovery for Distributed Systems

Authors:
Richard Koo;Sam Toueg
Affiliations:
-;Cornell Univ., Ithaca, NY
Venue:
IEEE Transactions on Software Engineering - Special issue on distributed systems
Year:
1987

Citing 0
Cited 172

Computing Optimal Checkpointing Strategies for Rollback and Recovery Systems

IEEE Transactions on Computers - Fault-Tolerant Computing
Concurrent common knowledge: a new definition of agreement for asynchronous systems

PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Efficient distributed recovery using message logging

Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Sufficient Condition for a Communication Deadlock and Distributed Deadlock Detection

IEEE Transactions on Software Engineering
Modeling of Hierarchical Distributed Systems with Fault-Tolerance

IEEE Transactions on Software Engineering
Fault-tolerant computing based on Mach

ACM SIGOPS Operating Systems Review
The inhibition spectrum and the achievement of causal consistency

PODC '90 Proceedings of the ninth annual ACM symposium on Principles of distributed computing
Shortest paths and loop-free routing in dynamic networks

SIGCOMM '90 Proceedings of the ACM symposium on Communications architectures & protocols
Understanding fault-tolerant distributed systems

Communications of the ACM
Transparent optimistic rollback recovery

ACM SIGOPS Operating Systems Review
Restoring consistent global states of distributed computations

PADD '91 Proceedings of the 1991 ACM/ONR workshop on Parallel and distributed debugging
Adapting to asynchronous dynamic networks (extended abstract)

STOC '92 Proceedings of the twenty-fourth annual ACM symposium on Theory of computing
An abstract model of rollback recovery control in distributed systems

ACM SIGOPS Operating Systems Review
A checkpointing recovery approach in a distributed system on the CSMA/CD network

SAC '92 Proceedings of the 1992 ACM/SIGAPP Symposium on Applied computing: technological challenges of the 1990's
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Use of Common Time Base for Checkpointing and Rollback Recovery in a Distributed System

IEEE Transactions on Software Engineering
A checkpoint protocol for an entry consistent shared memory system

PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
Supporting Fault-Tolerant Parallel Programming in Linda

IEEE Transactions on Parallel and Distributed Systems
Necessary and Sufficient Conditions for Consistent Global Snapshots

IEEE Transactions on Parallel and Distributed Systems
Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems.

IEEE Transactions on Parallel and Distributed Systems
On distributed object checkpointing and recovery

Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing
A Dynamic Coherence Protocol for Distributed Shared Memory Enforcing High Data Availability at Low Costs

IEEE Transactions on Parallel and Distributed Systems
Automatic incremental state saving

PADS '96 Proceedings of the tenth workshop on Parallel and distributed simulation
Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems

IEEE Transactions on Parallel and Distributed Systems
An Architecture for Tolerating Processor Failures in Shared-Memory Multiprocessors

IEEE Transactions on Computers
Optimistic Crash Recovery without Changing Application Messages

IEEE Transactions on Parallel and Distributed Systems
A Survey of Distributed Database Checkpointing

Distributed and Parallel Databases
A Survey of Recoverable Distributed Shared Virtual Memory Systems

IEEE Transactions on Parallel and Distributed Systems
Persistent messages in local transactions

PODC '98 Proceedings of the seventeenth annual ACM symposium on Principles of distributed computing
Damage Assessment for Optimal Rollback Recovery

IEEE Transactions on Computers
Theoretical Analysis for Communication-Induced Checkpointing Protocols with Rollback-Dependency Trackability

IEEE Transactions on Parallel and Distributed Systems
On Coordinated Checkpointing in Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
An Index-Based Checkpointing Algorithm for Autonomous Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
Fast cluster failover using virtual memory-mapped communication

ICS '99 Proceedings of the 13th international conference on Supercomputing
Staggered Consistent Checkpointing

IEEE Transactions on Parallel and Distributed Systems
Quasi-Synchronous Checkpointing: Models, Characterization, and Classification

IEEE Transactions on Parallel and Distributed Systems
Communication-Induced Determination of Consistent Snapshots

IEEE Transactions on Parallel and Distributed Systems
Checkpointing and rollback-recovery for distributed systems

ACM '86 Proceedings of 1986 ACM Fall joint computer conference
Supporting Cost-Effective Fault Tolerance in Distributed Message-Passing Applications with File Operations

The Journal of Supercomputing
Mutable Checkpoints: A New Checkpointing Approach for Mobile Computing Systems

IEEE Transactions on Parallel and Distributed Systems
Transparent optimistic rollback recovery

EW 4 Proceedings of the 4th workshop on ACM SIGOPS European workshop
Fault-tolerant parallel computing

EW 4 Proceedings of the 4th workshop on ACM SIGOPS European workshop
A Roll-Forward Recovery Scheme for Solving the Problem of Coasting Forward for Distributed Systems

ACM SIGOPS Operating Systems Review
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Efficient Garbage Collection Schemes for Causal Message Logging with Independent Checkpointing

The Journal of Supercomputing
A sliding-agent-group communication model for constructing a robust roaming environment over internet

Mobile Networks and Applications
Complete Process Recovery: Using Vector Time to Handle Multiple Failures in Distributed Systems

IEEE Parallel & Distributed Technology: Systems & Technology
Overview of multidatabase transaction management

The VLDB Journal — The International Journal on Very Large Data Bases
Speed Log: A Generic Log Service Supporting Efficient Node-Crash Recovery

IEEE Micro
Replica Management for Fault-Tolerant Systems

IEEE Micro
Nest: A Nested-Predicate Scheme for Fault Tolerance

IEEE Transactions on Computers
An Adaptive Checkpointing Scheme for Distributed Databases with Mixed Types of Transactions

IEEE Transactions on Knowledge and Data Engineering
Error Recovery in Shared Memory Multiprocessors Using Private Caches

IEEE Transactions on Parallel and Distributed Systems
Rollback Recovery in Distributed Systems Using Loosely Synchronized Clocks

IEEE Transactions on Parallel and Distributed Systems
Checkpointing for Distributed Databases: Starting from the Basics

IEEE Transactions on Parallel and Distributed Systems
An Efficient Protocol for Checkpointing Recovery in Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
Low-Latency, Concurrent Checkpointing for Parallel Programs

IEEE Transactions on Parallel and Distributed Systems
Efficient Rollback-Recovery Technique in Distributed Computing Systems

IEEE Transactions on Parallel and Distributed Systems
Finding Consistent Global Checkpoints in a Distributed Computation

IEEE Transactions on Parallel and Distributed Systems
Checkpointing with mutable checkpoints

Theoretical Computer Science - Dependable computing
Asynchronous recovery without using vector timestamps

Journal of Parallel and Distributed Computing
Local stabilizer

Journal of Parallel and Distributed Computing - Self-stabilizing distributed systems
Interval consistency of asynchronous distributed computations

Journal of Computer and System Sciences
Performance Evaluation of Fault Tolerance for Parallel Applications in Networked Environments

ICPP '97 Proceedings of the international Conference on Parallel Processing
Efficient Garbage Collection Schemes for Causal Message Logging with Independent Checkpointing in Message Passing Systems

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Checkpointing and Rollback of Wide-area Distributed Applications using Mobile Agents

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
QoS based Checkpoint Protocol in Multimedia Network Systems

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Guaranteed Mutually Consistent Checkpointing in Distributed Computations

ASIAN '98 Proceedings of the 4th Asian Computing Science Conference on Advances in Computing Science
An Efficient Coordinated Checkpointing Scheme Based on PWD Model

ICOIN '02 Revised Papers from the International Conference on Information Networking, Wireless Communications Technologies and Network Applications-Part II
Checkpoint-Recovery for Mobile Intelligent Networks

Proceedings of the 14th International conference on Industrial and engineering applications of artificial intelligence and expert systems: engineering of intelligent systems
A Recovery Technique Using Multi-agent in Distributed Computing Systems

COORDINATION '02 Proceedings of the 5th International Conference on Coordination Models and Languages
A Fault-Tolerant Scheme of Multi-agent System for Worker Agents

AMT '01 Proceedings of the 6th International Computer Science Conference on Active Media Technology
The Design and Use of Persistent Memory on the DNCP Hardware Fault-Tolerant Platform

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Distributed Checkpointing Mechanism for a Parallel File System

Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
QoS-Based Checkpoint Protocol for Multimedia Network Systems

PCM '01 Proceedings of the Second IEEE Pacific Rim Conference on Multimedia: Advances in Multimedia Information Processing
Protocol for Taking Object-Based Checkpoints

DEXA '00 Proceedings of the 11th International Conference on Database and Expert Systems Applications
An Efficient Optimistic Message Logging Scheme for Recoverable Mobile Computing Systems

IEEE Transactions on Mobile Computing
Supporting fault-tolerance in heterogeneous distributed applications

HCW '97 Proceedings of the 6th Heterogeneous Computing Workshop (HCW '97)
Implementation and performance of a stable-storage service in Unix

SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Minimizing timestamp size for completely asynchronous optimistic recovery with minimal rollback

SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques

SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
An Efficient Checkpointing Algorithm for Distributed Systems Implementing Reliable Communication Channels

SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Deadlocks in fully uncoordinated checkpointing rollback recovery systems

WORDS '97 Proceedings of the 3rd Workshop on Object-Oriented Real-Time Dependable Systems - (WORDS '97)
Object-Based Checkpoints in Distributed Systems

WORDS '97 Proceedings of the 3rd Workshop on Object-Oriented Real-Time Dependable Systems - (WORDS '97)
Checkpoint and Rollback in Asynchronous Distributed Systems

INFOCOM '97 Proceedings of the INFOCOM '97. Sixteenth Annual Joint Conference of the IEEE Computer and Communications Societies. Driving the Information Revolution
User-Triggered Checkpointing: System-Independent and Scalable Application Recovery

ISCC '97 Proceedings of the 2nd IEEE Symposium on Computers and Communications (ISCC '97)
Micro-Checkpointing: Checkpointing for Multithreaded Applications

IOLTW '00 Proceedings of the 6th IEEE International On-Line Testing Workshop (IOLTW)
Algorithm-Based Diskless Checkpointing for Fault-Tolerant Matrix Operations

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Completely Asynchronous Optimistic Recovery with Minimal Rollbacks

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Checkpointing and Its Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Fault Tolerance for Off-the-Shelf Applications and Hardware

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Software Schemes of Reconfiguration and Recovery in Distributed Memory Multicomputers Using the Actor Model

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Selective Checkpointing and Rollbacks in Multithreaded Distributed Systems

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
An algorithm for Supporting Fault Tolerant Objects in Distributed Object-Oriented Operating Systems

IWOOOS '95 Proceedings of the 4th International Workshop on Object-Orientation in Operating Systems
Checkpointing and Recovery for Distributed Shared Memory Applications

IWOOOS '95 Proceedings of the 4th International Workshop on Object-Orientation in Operating Systems
On Properties of RDT Communication-Induced Checkpointing Protocols

IEEE Transactions on Parallel and Distributed Systems
A comparative analysis of the reliability of simple and two-level checkpointing techniques in two different distributed industrial control system architectures

Systems Analysis Modelling Simulation
An efficient time-based checkpointing protocol for mobile computing systems over mobile IP

Mobile Networks and Applications - Mobile networking through IP
Overview of multidatabase transaction management

CASCON '92 Proceedings of the 1992 conference of the Centre for Advanced Studies on Collaborative research - Volume 2
On designing direct dependency: based fast recovery algorithms for distributed systems

ACM SIGOPS Operating Systems Review
Finding a Recovery Line in Uncoordinated Checkpointing

ICDCSW '04 Proceedings of the 24th International Conference on Distributed Computing Systems Workshops - W7: EC (ICDCSW'04) - Volume 7
Recovery in the Mobile Wireless Environment Using Mobile Agents

IEEE Transactions on Mobile Computing
A causal message logging protocol for mobile nodes in mobile computing systems

Future Generation Computer Systems - Special issue: Advanced services for clusters and internet computing
Concurrent checkpoint initiation and recovery algorithms on asynchronous ring networks

Journal of Parallel and Distributed Computing
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery

IEEE Transactions on Dependable and Secure Computing
Communication-based prevention of useless checkpoints in distributed computations

Distributed Computing
Event Logging: Portable and Efficient Checkpointing in Heterogeneous Environments with Non-FIFO Communication Platforms

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 1 - Volume 02
A novel min-process checkpointing scheme for mobile computing systems

Journal of Systems Architecture: the EUROMICRO Journal
Efficient algorithms for optimistic crash recovery

Distributed Computing
Concurrent common knowledge: defining agreement for asynchronous systems

Distributed Computing
The inhibition spectrum and the achievement of causal consistency

Distributed Computing
Fault tolerance for internet agent systems: in cases of stop failure and byzantine failure

Proceedings of the fourth international joint conference on Autonomous agents and multiagent systems
An Efficient Index-Based Checkpointing Protocol with Constant-Size Control Information on Messages

IEEE Transactions on Dependable and Secure Computing
Performance analysis of different checkpointing and recovery schemes using stochastic model

Journal of Parallel and Distributed Computing
Finding a suitable checkpoint and recovery protocol for a distributed application

Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
Design, Analysis and Performance Evaluation of a New Algorithm for Developing a Fault Tolerant Distributed System

ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
Declarative failure recovery for sensor networks

Proceedings of the 6th international conference on Aspect-oriented software development
Quasi-atomic recovery for distributed agents

Parallel Computing
Exploring failure transparency and the limits of generic recovery

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Self-stabilizing algorithm for checkpointing in a distributed system

Journal of Parallel and Distributed Computing
A synchronous checkpointing protocol for mobile distributed systems: probabilistic approach

International Journal of Information and Computer Security
A novel non-block synchronous checkpointing scheme for distributed systems

ICS'05 Proceedings of the 9th WSEAS International Conference on Systems
A low-cost hybrid coordinated checkpointing protocol for mobile distributed systems

Mobile Information Systems
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

Information Sciences: an International Journal
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

Information Sciences: an International Journal
Lightweight log management algorithm for removing logged messages of sender processes with little overhead

WSEAS Transactions on Computers
An optimistic checkpointing and message logging approach for consistent global checkpoint collection in distributed systems

Journal of Parallel and Distributed Computing
FINE: A Fully Informed aNd Efficient communication-induced checkpointing protocol for distributed systems

Journal of Parallel and Distributed Computing
Prompt damage identification for system survivability

International Journal of Information and Computer Security
DTR: Distributed Transaction Routing in a Large Scale Network

High Performance Computing for Computational Science - VECPAR 2008
Novel Crash Recovery Approach for Concurrent Failures in Cluster Federation

GPC '09 Proceedings of the 4th International Conference on Advances in Grid and Pervasive Computing
A novel low-overhead recovery approach for distributed systems

Journal of Computer Systems, Networks, and Communications
Database replication in large scale systems: optimizing the number of replicas

Proceedings of the 2009 EDBT/ICDT Workshops
A weighted checkpointing protocol for mobile distributed systems

International Journal of Ad Hoc and Ubiquitous Computing
A novel recovery approach for cluster federations

GPC'07 Proceedings of the 2nd international conference on Advances in grid and pervasive computing
Domino-effect free crash recovery for concurrent failures in cluster federation

GPC'08 Proceedings of the 3rd international conference on Advances in grid and pervasive computing
On-line error detection and fast recover techniques for dependable embedded processors

On-line error detection and fast recover techniques for dependable embedded processors
Overview of multidatabase transaction management

CASCON First Decade High Impact Papers
Peers-for-peers (P4P): an efficient and reliable fault-tolerance strategy for cycle-stealing P2P applications

International Journal of Communication Networks and Distributed Systems
FRASystem: fault tolerant system using agents in distributed computing systems

Cluster Computing
Theoretical and experimental evaluation of communication-induced checkpointing protocols in FE and FLazy-E families

Performance Evaluation
New & efficient low overheads algorithm for mobile distributed systems

Proceedings of the International Conference & Workshop on Emerging Trends in Technology
New & efficient low overheads algorithm for mobile distributed systems

Proceedings of the International Conference & Workshop on Emerging Trends in Technology
Rebound: scalable checkpointing for coherent shared memory

Proceedings of the 38th annual international symposium on Computer architecture
Distributed middleware reliability and fault tolerance support in system S

Proceedings of the 5th ACM international conference on Distributed event-based system
Brief announcement: a concurrent partial snapshot algorithm for large-scale and dynamic distributed systems

SSS'11 Proceedings of the 13th international conference on Stabilization, safety, and security of distributed systems
A global snapshot collection algorithm with concurrent initiators with non-FIFO channel

ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part I
A proxy based efficient checkpointing scheme for fault recovery in mobile grid system

HiPC'06 Proceedings of the 13th international conference on High Performance Computing
An efficient and scalable checkpointing and recovery algorithm for distributed systems

ICDCN'06 Proceedings of the 8th international conference on Distributed Computing and Networking
An efficient computing-checkpoint based coordinated checkpoint algorithm

EUC'06 Proceedings of the 2006 international conference on Embedded and Ubiquitous Computing
An asynchronous recovery algorithm based on a staggered quasi-synchronous checkpointing algorithm

IWDC'05 Proceedings of the 7th international conference on Distributed Computing
Garbage collection in a causal message logging protocol

HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
A communication-induced checkpointing and asynchronous recovery algorithm for multithreaded distributed systems

PDCAT'04 Proceedings of the 5th international conference on Parallel and Distributed Computing: applications and Technologies
Using computing checkpoints implement consistent low-cost non-blocking coordinated checkpointing

PDCAT'04 Proceedings of the 5th international conference on Parallel and Distributed Computing: applications and Technologies
Optimizing performance and reliability on heterogeneous parallel systems: Approximation algorithms and heuristics

Journal of Parallel and Distributed Computing
A fault-tolerant multi-agent development framework

ISPA'04 Proceedings of the Second international conference on Parallel and Distributed Processing and Applications
An efficient protocol for checkpoint-based failure recovery in distributed systems

ICDCIT'04 Proceedings of the First international conference on Distributed Computing and Internet Technology
Implementing rollback-recovery coordinated checkpoints

ISSADS'05 Proceedings of the 5th international conference on Advanced Distributed Systems
Mobile agent based fault-tolerance support for the reliable mobile computing systems

COORDINATION'05 Proceedings of the 7th international conference on Coordination Models and Languages
Energy efficient configuration for qos in reliable parallel servers

EDCC'05 Proceedings of the 5th European conference on Dependable Computing
A low-overhead non-block checkpointing algorithm for mobile computing environment

GPC'06 Proceedings of the First international conference on Advances in Grid and Pervasive Computing
An efficient algorithm for removing useless logged messages in SBML protocols

ICDCIT'05 Proceedings of the Second international conference on Distributed Computing and Internet Technology
Analysis of interval-based global state detection

ICDCIT'05 Proceedings of the Second international conference on Distributed Computing and Internet Technology
Recovery approach to the design of stabilizing communication protocols

Computer Communications
Research: Debugging tool for distributed Estelle programs

Computer Communications
Research: Modified distributed snapshots algorithm for protocol stabilization

Computer Communications
Optimal checkpointing interval of a communication system with rollback recovery

Mathematical and Computer Modelling: An International Journal
Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Achieving high job execution reliability using underutilized resources in a computational economy

Future Generation Computer Systems
Software health management with Bayesian networks

Innovations in Systems and Software Engineering
Orphan-Free Consistent Condition for Log-Based Checkpointing and Rollback Recovery Scheme

International Journal of Advanced Pervasive and Ubiquitous Computing
Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems

Scientific Programming - Selected Papers from Super Computing 2012

Quantified Score

Hi-index	0.02

Visualization

Abstract

We consider the problem of bringing a distributed system to a consistent state after transient failures. We address the two components of this problem by describing a distributed algorithm to create consistent checkpoints, as well as a rollback-recovery algorithm to recover the system to a consistent state. In contrast to previous algorithms, they tolerate failures that occur during their executions. Furthermore, when a process takes a checkpoint, a minimal number of additional processes are forced to take checkpoints. Similarly, when a process rolls back and restarts after a failure, a minimal number of additional processes are forced to roll back with it. Our algorithms require each process to store at most two checkpoints in stable storage. This storage requirement is shown to be minimal under general assumptions.