Computing Optimal Checkpointing Strategies for Rollback and Recovery Systems
IEEE Transactions on Computers - Fault-Tolerant Computing
Concurrent common knowledge: a new definition of agreement for asynchronous systems
PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Efficient distributed recovery using message logging
Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Sufficient Condition for a Communication Deadlock and Distributed Deadlock Detection
IEEE Transactions on Software Engineering
Modeling of Hierarchical Distributed Systems with Fault-Tolerance
IEEE Transactions on Software Engineering
Fault-tolerant computing based on Mach
ACM SIGOPS Operating Systems Review
The inhibition spectrum and the achievement of causal consistency
PODC '90 Proceedings of the ninth annual ACM symposium on Principles of distributed computing
Shortest paths and loop-free routing in dynamic networks
SIGCOMM '90 Proceedings of the ACM symposium on Communications architectures & protocols
Understanding fault-tolerant distributed systems
Communications of the ACM
Transparent optimistic rollback recovery
ACM SIGOPS Operating Systems Review
Restoring consistent global states of distributed computations
PADD '91 Proceedings of the 1991 ACM/ONR workshop on Parallel and distributed debugging
Adapting to asynchronous dynamic networks (extended abstract)
STOC '92 Proceedings of the twenty-fourth annual ACM symposium on Theory of computing
An abstract model of rollback recovery control in distributed systems
ACM SIGOPS Operating Systems Review
A checkpointing recovery approach in a distributed system on the CSMA/CD network
SAC '92 Proceedings of the 1992 ACM/SIGAPP Symposium on Applied computing: technological challenges of the 1990's
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Use of Common Time Base for Checkpointing and Rollback Recovery in a Distributed System
IEEE Transactions on Software Engineering
A checkpoint protocol for an entry consistent shared memory system
PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
Supporting Fault-Tolerant Parallel Programming in Linda
IEEE Transactions on Parallel and Distributed Systems
Necessary and Sufficient Conditions for Consistent Global Snapshots
IEEE Transactions on Parallel and Distributed Systems
Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems.
IEEE Transactions on Parallel and Distributed Systems
On distributed object checkpointing and recovery
Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing
IEEE Transactions on Parallel and Distributed Systems
Automatic incremental state saving
PADS '96 Proceedings of the tenth workshop on Parallel and distributed simulation
Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems
IEEE Transactions on Parallel and Distributed Systems
An Architecture for Tolerating Processor Failures in Shared-Memory Multiprocessors
IEEE Transactions on Computers
Optimistic Crash Recovery without Changing Application Messages
IEEE Transactions on Parallel and Distributed Systems
A Survey of Distributed Database Checkpointing
Distributed and Parallel Databases
A Survey of Recoverable Distributed Shared Virtual Memory Systems
IEEE Transactions on Parallel and Distributed Systems
Persistent messages in local transactions
PODC '98 Proceedings of the seventeenth annual ACM symposium on Principles of distributed computing
Damage Assessment for Optimal Rollback Recovery
IEEE Transactions on Computers
IEEE Transactions on Parallel and Distributed Systems
On Coordinated Checkpointing in Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
An Index-Based Checkpointing Algorithm for Autonomous Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Fast cluster failover using virtual memory-mapped communication
ICS '99 Proceedings of the 13th international conference on Supercomputing
Staggered Consistent Checkpointing
IEEE Transactions on Parallel and Distributed Systems
Quasi-Synchronous Checkpointing: Models, Characterization, and Classification
IEEE Transactions on Parallel and Distributed Systems
Communication-Induced Determination of Consistent Snapshots
IEEE Transactions on Parallel and Distributed Systems
Checkpointing and rollback-recovery for distributed systems
ACM '86 Proceedings of 1986 ACM Fall joint computer conference
The Journal of Supercomputing
Mutable Checkpoints: A New Checkpointing Approach for Mobile Computing Systems
IEEE Transactions on Parallel and Distributed Systems
Transparent optimistic rollback recovery
EW 4 Proceedings of the 4th workshop on ACM SIGOPS European workshop
Fault-tolerant parallel computing
EW 4 Proceedings of the 4th workshop on ACM SIGOPS European workshop
A Roll-Forward Recovery Scheme for Solving the Problem of Coasting Forward for Distributed Systems
ACM SIGOPS Operating Systems Review
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Efficient Garbage Collection Schemes for Causal Message Logging with Independent Checkpointing
The Journal of Supercomputing
Mobile Networks and Applications
Complete Process Recovery: Using Vector Time to Handle Multiple Failures in Distributed Systems
IEEE Parallel & Distributed Technology: Systems & Technology
Overview of multidatabase transaction management
The VLDB Journal — The International Journal on Very Large Data Bases
Nest: A Nested-Predicate Scheme for Fault Tolerance
IEEE Transactions on Computers
An Adaptive Checkpointing Scheme for Distributed Databases with Mixed Types of Transactions
IEEE Transactions on Knowledge and Data Engineering
Error Recovery in Shared Memory Multiprocessors Using Private Caches
IEEE Transactions on Parallel and Distributed Systems
Rollback Recovery in Distributed Systems Using Loosely Synchronized Clocks
IEEE Transactions on Parallel and Distributed Systems
Checkpointing for Distributed Databases: Starting from the Basics
IEEE Transactions on Parallel and Distributed Systems
An Efficient Protocol for Checkpointing Recovery in Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Low-Latency, Concurrent Checkpointing for Parallel Programs
IEEE Transactions on Parallel and Distributed Systems
Efficient Rollback-Recovery Technique in Distributed Computing Systems
IEEE Transactions on Parallel and Distributed Systems
Finding Consistent Global Checkpoints in a Distributed Computation
IEEE Transactions on Parallel and Distributed Systems
Checkpointing with mutable checkpoints
Theoretical Computer Science - Dependable computing
Asynchronous recovery without using vector timestamps
Journal of Parallel and Distributed Computing
Journal of Parallel and Distributed Computing - Self-stabilizing distributed systems
Interval consistency of asynchronous distributed computations
Journal of Computer and System Sciences
Performance Evaluation of Fault Tolerance for Parallel Applications in Networked Environments
ICPP '97 Proceedings of the international Conference on Parallel Processing
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Checkpointing and Rollback of Wide-area Distributed Applications using Mobile Agents
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
QoS based Checkpoint Protocol in Multimedia Network Systems
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Guaranteed Mutually Consistent Checkpointing in Distributed Computations
ASIAN '98 Proceedings of the 4th Asian Computing Science Conference on Advances in Computing Science
An Efficient Coordinated Checkpointing Scheme Based on PWD Model
ICOIN '02 Revised Papers from the International Conference on Information Networking, Wireless Communications Technologies and Network Applications-Part II
Checkpoint-Recovery for Mobile Intelligent Networks
Proceedings of the 14th International conference on Industrial and engineering applications of artificial intelligence and expert systems: engineering of intelligent systems
A Recovery Technique Using Multi-agent in Distributed Computing Systems
COORDINATION '02 Proceedings of the 5th International Conference on Coordination Models and Languages
A Fault-Tolerant Scheme of Multi-agent System for Worker Agents
AMT '01 Proceedings of the 6th International Computer Science Conference on Active Media Technology
The Design and Use of Persistent Memory on the DNCP Hardware Fault-Tolerant Platform
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Distributed Checkpointing Mechanism for a Parallel File System
Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
QoS-Based Checkpoint Protocol for Multimedia Network Systems
PCM '01 Proceedings of the Second IEEE Pacific Rim Conference on Multimedia: Advances in Multimedia Information Processing
Protocol for Taking Object-Based Checkpoints
DEXA '00 Proceedings of the 11th International Conference on Database and Expert Systems Applications
An Efficient Optimistic Message Logging Scheme for Recoverable Mobile Computing Systems
IEEE Transactions on Mobile Computing
Supporting fault-tolerance in heterogeneous distributed applications
HCW '97 Proceedings of the 6th Heterogeneous Computing Workshop (HCW '97)
Implementation and performance of a stable-storage service in Unix
SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Minimizing timestamp size for completely asynchronous optimistic recovery with minimal rollback
SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Deadlocks in fully uncoordinated checkpointing rollback recovery systems
WORDS '97 Proceedings of the 3rd Workshop on Object-Oriented Real-Time Dependable Systems - (WORDS '97)
Object-Based Checkpoints in Distributed Systems
WORDS '97 Proceedings of the 3rd Workshop on Object-Oriented Real-Time Dependable Systems - (WORDS '97)
Checkpoint and Rollback in Asynchronous Distributed Systems
INFOCOM '97 Proceedings of the INFOCOM '97. Sixteenth Annual Joint Conference of the IEEE Computer and Communications Societies. Driving the Information Revolution
User-Triggered Checkpointing: System-Independent and Scalable Application Recovery
ISCC '97 Proceedings of the 2nd IEEE Symposium on Computers and Communications (ISCC '97)
Micro-Checkpointing: Checkpointing for Multithreaded Applications
IOLTW '00 Proceedings of the 6th IEEE International On-Line Testing Workshop (IOLTW)
Algorithm-Based Diskless Checkpointing for Fault-Tolerant Matrix Operations
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Completely Asynchronous Optimistic Recovery with Minimal Rollbacks
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Checkpointing and Its Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Fault Tolerance for Off-the-Shelf Applications and Hardware
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Selective Checkpointing and Rollbacks in Multithreaded Distributed Systems
ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
An algorithm for Supporting Fault Tolerant Objects in Distributed Object-Oriented Operating Systems
IWOOOS '95 Proceedings of the 4th International Workshop on Object-Orientation in Operating Systems
Checkpointing and Recovery for Distributed Shared Memory Applications
IWOOOS '95 Proceedings of the 4th International Workshop on Object-Orientation in Operating Systems
On Properties of RDT Communication-Induced Checkpointing Protocols
IEEE Transactions on Parallel and Distributed Systems
An efficient time-based checkpointing protocol for mobile computing systems over mobile IP
Mobile Networks and Applications - Mobile networking through IP
Overview of multidatabase transaction management
CASCON '92 Proceedings of the 1992 conference of the Centre for Advanced Studies on Collaborative research - Volume 2
On designing direct dependency: based fast recovery algorithms for distributed systems
ACM SIGOPS Operating Systems Review
Finding a Recovery Line in Uncoordinated Checkpointing
ICDCSW '04 Proceedings of the 24th International Conference on Distributed Computing Systems Workshops - W7: EC (ICDCSW'04) - Volume 7
Recovery in the Mobile Wireless Environment Using Mobile Agents
IEEE Transactions on Mobile Computing
A causal message logging protocol for mobile nodes in mobile computing systems
Future Generation Computer Systems - Special issue: Advanced services for clusters and internet computing
Concurrent checkpoint initiation and recovery algorithms on asynchronous ring networks
Journal of Parallel and Distributed Computing
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery
IEEE Transactions on Dependable and Secure Computing
Communication-based prevention of useless checkpoints in distributed computations
Distributed Computing
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 1 - Volume 02
A novel min-process checkpointing scheme for mobile computing systems
Journal of Systems Architecture: the EUROMICRO Journal
Efficient algorithms for optimistic crash recovery
Distributed Computing
Concurrent common knowledge: defining agreement for asynchronous systems
Distributed Computing
The inhibition spectrum and the achievement of causal consistency
Distributed Computing
Fault tolerance for internet agent systems: in cases of stop failure and byzantine failure
Proceedings of the fourth international joint conference on Autonomous agents and multiagent systems
An Efficient Index-Based Checkpointing Protocol with Constant-Size Control Information on Messages
IEEE Transactions on Dependable and Secure Computing
Performance analysis of different checkpointing and recovery schemes using stochastic model
Journal of Parallel and Distributed Computing
Finding a suitable checkpoint and recovery protocol for a distributed application
Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
Declarative failure recovery for sensor networks
Proceedings of the 6th international conference on Aspect-oriented software development
Quasi-atomic recovery for distributed agents
Parallel Computing
Exploring failure transparency and the limits of generic recovery
OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Self-stabilizing algorithm for checkpointing in a distributed system
Journal of Parallel and Distributed Computing
A synchronous checkpointing protocol for mobile distributed systems: probabilistic approach
International Journal of Information and Computer Security
A novel non-block synchronous checkpointing scheme for distributed systems
ICS'05 Proceedings of the 9th WSEAS International Conference on Systems
A low-cost hybrid coordinated checkpointing protocol for mobile distributed systems
Mobile Information Systems
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage
Information Sciences: an International Journal
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage
Information Sciences: an International Journal
WSEAS Transactions on Computers
Journal of Parallel and Distributed Computing
Journal of Parallel and Distributed Computing
Prompt damage identification for system survivability
International Journal of Information and Computer Security
DTR: Distributed Transaction Routing in a Large Scale Network
High Performance Computing for Computational Science - VECPAR 2008
Novel Crash Recovery Approach for Concurrent Failures in Cluster Federation
GPC '09 Proceedings of the 4th International Conference on Advances in Grid and Pervasive Computing
A novel low-overhead recovery approach for distributed systems
Journal of Computer Systems, Networks, and Communications
Database replication in large scale systems: optimizing the number of replicas
Proceedings of the 2009 EDBT/ICDT Workshops
A weighted checkpointing protocol for mobile distributed systems
International Journal of Ad Hoc and Ubiquitous Computing
A novel recovery approach for cluster federations
GPC'07 Proceedings of the 2nd international conference on Advances in grid and pervasive computing
Domino-effect free crash recovery for concurrent failures in cluster federation
GPC'08 Proceedings of the 3rd international conference on Advances in grid and pervasive computing
On-line error detection and fast recover techniques for dependable embedded processors
On-line error detection and fast recover techniques for dependable embedded processors
Overview of multidatabase transaction management
CASCON First Decade High Impact Papers
International Journal of Communication Networks and Distributed Systems
New & efficient low overheads algorithm for mobile distributed systems
Proceedings of the International Conference & Workshop on Emerging Trends in Technology
New & efficient low overheads algorithm for mobile distributed systems
Proceedings of the International Conference & Workshop on Emerging Trends in Technology
Rebound: scalable checkpointing for coherent shared memory
Proceedings of the 38th annual international symposium on Computer architecture
Distributed middleware reliability and fault tolerance support in system S
Proceedings of the 5th ACM international conference on Distributed event-based system
SSS'11 Proceedings of the 13th international conference on Stabilization, safety, and security of distributed systems
A global snapshot collection algorithm with concurrent initiators with non-FIFO channel
ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part I
A proxy based efficient checkpointing scheme for fault recovery in mobile grid system
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
An efficient and scalable checkpointing and recovery algorithm for distributed systems
ICDCN'06 Proceedings of the 8th international conference on Distributed Computing and Networking
An efficient computing-checkpoint based coordinated checkpoint algorithm
EUC'06 Proceedings of the 2006 international conference on Embedded and Ubiquitous Computing
An asynchronous recovery algorithm based on a staggered quasi-synchronous checkpointing algorithm
IWDC'05 Proceedings of the 7th international conference on Distributed Computing
Garbage collection in a causal message logging protocol
HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
PDCAT'04 Proceedings of the 5th international conference on Parallel and Distributed Computing: applications and Technologies
Using computing checkpoints implement consistent low-cost non-blocking coordinated checkpointing
PDCAT'04 Proceedings of the 5th international conference on Parallel and Distributed Computing: applications and Technologies
Journal of Parallel and Distributed Computing
A fault-tolerant multi-agent development framework
ISPA'04 Proceedings of the Second international conference on Parallel and Distributed Processing and Applications
An efficient protocol for checkpoint-based failure recovery in distributed systems
ICDCIT'04 Proceedings of the First international conference on Distributed Computing and Internet Technology
Implementing rollback-recovery coordinated checkpoints
ISSADS'05 Proceedings of the 5th international conference on Advanced Distributed Systems
Mobile agent based fault-tolerance support for the reliable mobile computing systems
COORDINATION'05 Proceedings of the 7th international conference on Coordination Models and Languages
Energy efficient configuration for qos in reliable parallel servers
EDCC'05 Proceedings of the 5th European conference on Dependable Computing
A low-overhead non-block checkpointing algorithm for mobile computing environment
GPC'06 Proceedings of the First international conference on Advances in Grid and Pervasive Computing
An efficient algorithm for removing useless logged messages in SBML protocols
ICDCIT'05 Proceedings of the Second international conference on Distributed Computing and Internet Technology
Analysis of interval-based global state detection
ICDCIT'05 Proceedings of the Second international conference on Distributed Computing and Internet Technology
Recovery approach to the design of stabilizing communication protocols
Computer Communications
Research: Debugging tool for distributed Estelle programs
Computer Communications
Research: Modified distributed snapshots algorithm for protocol stabilization
Computer Communications
Optimal checkpointing interval of a communication system with rollback recovery
Mathematical and Computer Modelling: An International Journal
Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Achieving high job execution reliability using underutilized resources in a computational economy
Future Generation Computer Systems
Software health management with Bayesian networks
Innovations in Systems and Software Engineering
Orphan-Free Consistent Condition for Log-Based Checkpointing and Rollback Recovery Scheme
International Journal of Advanced Pervasive and Ubiquitous Computing
Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems
Scientific Programming - Selected Papers from Super Computing 2012
Hi-index | 0.02 |
We consider the problem of bringing a distributed system to a consistent state after transient failures. We address the two components of this problem by describing a distributed algorithm to create consistent checkpoints, as well as a rollback-recovery algorithm to recover the system to a consistent state. In contrast to previous algorithms, they tolerate failures that occur during their executions. Furthermore, when a process takes a checkpoint, a minimal number of additional processes are forced to take checkpoints. Similarly, when a process rolls back and restarts after a failure, a minimal number of additional processes are forced to roll back with it. Our algorithms require each process to store at most two checkpoints in stable storage. This storage requirement is shown to be minimal under general assumptions.