Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Reliable communication in the presence of failures
ACM Transactions on Computer Systems (TOCS)
Efficient distributed recovery using message logging
Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Recovery in distributed systems using optimistic message logging and check-pointing
Journal of Algorithms
The causal ordering abstraction and a simple way to implement it
Information Processing Letters
Lightweight causal and atomic group multicast
ACM Transactions on Computer Systems (TOCS)
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Understanding the message logging paradigm for masking process crashes
Understanding the message logging paradigm for masking process crashes
Trade-offs in implementing causal message logging protocols
PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
Byzantine generals in action: implementing fail-stop processors
ACM Transactions on Computer Systems (TOCS)
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Notes on Data Base Operating Systems
Operating Systems, An Advanced Course
A message system supporting fault tolerance
SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Publishing: a reliable broadcast communication mechanism
SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Efficient algorithms for optimistic crash recovery
Distributed Computing
Staggered Consistent Checkpointing
IEEE Transactions on Parallel and Distributed Systems
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Efficient Garbage Collection Schemes for Causal Message Logging with Independent Checkpointing
The Journal of Supercomputing
The Cost of Recovery in Message Logging Protocols
IEEE Transactions on Knowledge and Data Engineering
Design and Analysis of an Integrated Checkpointing and Recovery Scheme for Distributed Applications
IEEE Transactions on Knowledge and Data Engineering
Process Recovery in Heterogeneous Systems
IEEE Transactions on Computers
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
An Efficient Coordinated Checkpointing Scheme Based on PWD Model
ICOIN '02 Revised Papers from the International Conference on Information Networking, Wireless Communications Technologies and Network Applications-Part II
Scalable Causal Message Logging for Wide-Area Environments
Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
A Recovery Technique Using Multi-agent in Distributed Computing Systems
COORDINATION '02 Proceedings of the 5th International Conference on Coordination Models and Languages
State Synchronization and Recovery for Strongly Consistent Replicated CORBA Objects
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
An Adaptive Checkpointing Protocol to Bound Recovery Time with Message Logging
SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Efficient damage assessment and repair in resilient distributed database systems
Das'01 Proceedings of the fifteenth annual working conference on Database and application security
Causality tracking in causal message-logging protocols
Distributed Computing
Recovery in the Mobile Wireless Environment Using Mobile Agents
IEEE Transactions on Mobile Computing
A causal message logging protocol for mobile nodes in mobile computing systems
Future Generation Computer Systems - Special issue: Advanced services for clusters and internet computing
Replication for web hosting systems
ACM Computing Surveys (CSUR)
Replication for web hosting systems
ACM Computing Surveys (CSUR)
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 1 - Volume 02
Design and Implementation of Multiple Fault-Tolerant MPI over Myrinet (M^3)
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
Checkpointing and rollback-recovery protocol integrated with VsSG protocol for RYW session guarantee
PDCN'06 Proceedings of the 24th IASTED international conference on Parallel and distributed computing and networks
ExecRecorder: VM-based full-system replay for attack analysis and system recovery
Proceedings of the 1st workshop on Architectural and system support for improving software dependability
Declarative failure recovery for sensor networks
Proceedings of the 6th international conference on Aspect-oriented software development
Quasi-atomic recovery for distributed agents
Parallel Computing
Peer-to-Peer and fault-tolerance: Towards deployment-based technical services
Future Generation Computer Systems
Implementing causal logging using OrbixWeb interception
COOTS'99 Proceedings of the 5th conference on USENIX Conference on Object-Oriented Technologies & Systems - Volume 5
Dependability evaluation of dedicated server group orphan detection method
ICS'05 Proceedings of the 9th WSEAS International Conference on Systems
Preventing of burst traffic in DSG method
ICS'05 Proceedings of the 9th WSEAS International Conference on Systems
AMCOS'05 Proceedings of the 4th WSEAS International Conference on Applied Mathematics and Computer Science
Active Optimistic Message Logging for Reliable Execution of MPI Applications
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Message fragment based causal message logging
Journal of Parallel and Distributed Computing
Towards Zero-Delay Recovery of Agents in Production Automation Systems
WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 02
Improving message logging protocols scalability through distributed event logging
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Damage assessment and repair in attack resilient distributed database systems
Computer Standards & Interfaces
Checkpointing and rollback-recovery protocol for mobile systems with MW session guarantee
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Agent based dynamic recovery protocol in distributed databases
ISPDC'03 Proceedings of the Second international conference on Parallel and distributed computing
Self-refined fault tolerance in HPC using dynamic dependent process groups
IWDC'05 Proceedings of the 7th international conference on Distributed Computing
A fault-tolerant multi-agent development framework
ISPA'04 Proceedings of the Second international conference on Parallel and Distributed Processing and Applications
A checkpoint/recovery model for heterogeneous dataflow computations using work-stealing
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Alleviating scalability issues of checkpointing protocols
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
The viability of using compression to decrease message log sizes
Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds
Journal of Parallel and Distributed Computing
Orphan-Free Consistent Condition for Log-Based Checkpointing and Rollback Recovery Scheme
International Journal of Advanced Pervasive and Ubiquitous Computing
Hi-index | 0.01 |
Message-logging protocols are an integral part of a popular technique for implementing processes that can recover from crash failures. All message-logging protocols require that, when recovery is complete, there be no orphan processes, which are surviving processes whose states are inconsistent with the recovered state of a crashed process. We give a precise specification of the consistency property "no orphan processes." From this specification, we describe how different existing classes of message-logging protocols (namely optimistic, pessimistic, and a class that we call causal) implement this property. We then propose a set of metrics to evaluate the performance of message-logging protocols, and characterize the protocols that are optimal with respect to these metrics. Finally, starting from a protocol that relies on causal delivery order, we show how to derive optimal causal protocols that tolerate f overlapping failures and recoveries for a parameter f : 1 驴f驴n.