Optimistic recovery in distributed systems

Authors:
Rob Strom;Shaula Yemini
Affiliations:
IBM Thomas J. Watson Research Center, Yorktown Heights, NY;IBM Thomas J. Watson Research Center, Yorktown Heights, NY
Venue:
ACM Transactions on Computer Systems (TOCS)
Year:
1985

Citing 9
Cited 190

Computer networks

Computer networks
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
The Recovery Manager of the System R Database Manager

ACM Computing Surveys (CSUR)
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Guardians and actions: linguistic support for robust, distributed programs

POPL '82 Proceedings of the 9th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Recovery semantics for a DB/DC system

ACM '73 Proceedings of the ACM annual conference
Recovery scenario for a DB/DC system

ACM '73 Proceedings of the ACM annual conference
A message system supporting fault tolerance

SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Efficient commit protocols for the tree of processes model of distributed transactions

PODC '83 Proceedings of the second annual ACM symposium on Principles of distributed computing

Progressive transaction recovery in distributed DB/DC systems

IEEE Transactions on Computers - Special Issue on Real-Time Systems
Exploiting virtual synchrony in distributed systems

SOSP '87 Proceedings of the eleventh ACM Symposium on Operating systems principles
Inheritance of Synchronization and Recovery Properties in Avalon/C++

Computer
Debugging concurrent processes: a case study

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Fault tolerance under UNIX

ACM Transactions on Computer Systems (TOCS)
Recovery in distributed systems using asynchronous message logging and checkpointing

PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Preserving and using context information in interprocess communication

ACM Transactions on Computer Systems (TOCS)
A graphical representation of concurrent processes

PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
Programming languages for distributed computing systems

ACM Computing Surveys (CSUR)
Efficient distributed recovery using message logging

Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Demonic memory for process histories

PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
Fault-tolerant computing based on Mach

ACM SIGOPS Operating Systems Review
Communication with directed logic variables

POPL '91 Proceedings of the 18th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Understanding fault-tolerant distributed systems

Communications of the ACM
Optimistic parallelization of communicating sequential processes

PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
Replay, recovery, replication, and snapshots of nondeterministic concurrent programs

PODC '91 Proceedings of the tenth annual ACM symposium on Principles of distributed computing
Transparent recovery in distributed systems (position paper)

ACM SIGOPS Operating Systems Review
Transparent optimistic rollback recovery

ACM SIGOPS Operating Systems Review
Restoring consistent global states of distributed computations

PADD '91 Proceedings of the 1991 ACM/ONR workshop on Parallel and distributed debugging
About logical clocks for distributed systems

ACM SIGOPS Operating Systems Review
Optimistic Make (Software Design)

IEEE Transactions on Computers
Design and performance of multipath MIN architectures

SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
An abstract model of rollback recovery control in distributed systems

ACM SIGOPS Operating Systems Review
A checkpointing recovery approach in a distributed system on the CSMA/CD network

SAC '92 Proceedings of the 1992 ACM/SIGAPP Symposium on Applied computing: technological challenges of the 1990's
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Adaptive message logging for incremental replay of message-passing programs

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Propagation of authorizations in distributed database systems

CCS '94 Proceedings of the 2nd ACM Conference on Computer and communications security
A checkpoint protocol for an entry consistent shared memory system

PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems.

IEEE Transactions on Parallel and Distributed Systems
On distributed object checkpointing and recovery

Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing
On the relevance of communication costs of rollback-recovery protocols

Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing
Formal semantics for expressing optimism: the meaning of HOPE

Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing
Logical Time: Capturing Causality in Distributed Systems

Computer
A unified approach to fault-tolerance in communication protocols based on recovery procedures

IEEE/ACM Transactions on Networking (TON)
Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems

IEEE Transactions on Parallel and Distributed Systems
An Architecture for Tolerating Processor Failures in Shared-Memory Multiprocessors

IEEE Transactions on Computers
Trade-offs in implementing causal message logging protocols

PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
Optimistic Crash Recovery without Changing Application Messages

IEEE Transactions on Parallel and Distributed Systems
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints

IEEE Transactions on Computers
Progressive Retry for Software Failure Recovery in Message-Passing Applications

IEEE Transactions on Computers
Optimistic distributed simulation based on transitive dependency tracking

Proceedings of the eleventh workshop on Parallel and distributed simulation
Efficient transparent application recovery in client-server information systems

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Persistent messages in local transactions

PODC '98 Proceedings of the seventeenth annual ACM symposium on Principles of distributed computing
Fault-tolerant distributed simulation

PADS '98 Proceedings of the twelfth workshop on Parallel and distributed simulation
Damage Assessment for Optimal Rollback Recovery

IEEE Transactions on Computers
Support for Software Interrupts in Log-Based Rollback-Recovery

IEEE Transactions on Computers
Theoretical Analysis for Communication-Induced Checkpointing Protocols with Rollback-Dependency Trackability

IEEE Transactions on Parallel and Distributed Systems
On Coordinated Checkpointing in Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
Logical logging to extend recovery to new domains

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Fast cluster failover using virtual memory-mapped communication

ICS '99 Proceedings of the 13th international conference on Supercomputing
Checkpointing and rollback-recovery for distributed systems

ACM '86 Proceedings of 1986 ACM Fall joint computer conference
Statically Safe Speculative Execution for Real-Time Systems

IEEE Transactions on Software Engineering
Supporting Cost-Effective Fault Tolerance in Distributed Message-Passing Applications with File Operations

The Journal of Supercomputing
Scalable fault-tolerant distributed shared memory

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Mutable Checkpoints: A New Checkpointing Approach for Mobile Computing Systems

IEEE Transactions on Parallel and Distributed Systems
Transparent recovery in distributed systems

EW 4 Proceedings of the 4th workshop on ACM SIGOPS European workshop
Transparent optimistic rollback recovery

EW 4 Proceedings of the 4th workshop on ACM SIGOPS European workshop
Fault-tolerant parallel computing

EW 4 Proceedings of the 4th workshop on ACM SIGOPS European workshop
Operating system level support for coherence in distributed systems

EW 5 Proceedings of the 5th workshop on ACM SIGOPS European workshop: Models and paradigms for distributed systems structuring
Easing the management of data-parallel systems via adaptation

EW 9 Proceedings of the 9th workshop on ACM SIGOPS European workshop: beyond the PC: new challenges for the operating system
Lazy Garbage Collection of Recovery State for Fault-Tolerant Distributed Shared Memory

IEEE Transactions on Parallel and Distributed Systems
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Efficient Garbage Collection Schemes for Causal Message Logging with Independent Checkpointing

The Journal of Supercomputing
Toward Ubiquitous Environments for Mobile Users

IEEE Internet Computing
Adaptive Message Logging for Incremental Program Replay

IEEE Parallel & Distributed Technology: Systems & Technology
Complete Process Recovery: Using Vector Time to Handle Multiple Failures in Distributed Systems

IEEE Parallel & Distributed Technology: Systems & Technology
Recovering from Multiple Process Failures in the Time Warp Mechanism

IEEE Transactions on Computers
The Cost of Recovery in Message Logging Protocols

IEEE Transactions on Knowledge and Data Engineering
Error Recovery in Shared Memory Multiprocessors Using Private Caches

IEEE Transactions on Parallel and Distributed Systems
Rollback Recovery in Distributed Systems Using Loosely Synchronized Clocks

IEEE Transactions on Parallel and Distributed Systems
An Efficient Protocol for Checkpointing Recovery in Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
Efficient Rollback-Recovery Technique in Distributed Computing Systems

IEEE Transactions on Parallel and Distributed Systems
Lazy Garbage Collection of Recovery State for Fault-Tolerant Distributed Shared Memory

IEEE Transactions on Parallel and Distributed Systems
Message Logging: Pessimistic, Optimistic, Causal, and Optimal

IEEE Transactions on Software Engineering
Checkpointing with mutable checkpoints

Theoretical Computer Science - Dependable computing
Asynchronous recovery without using vector timestamps

Journal of Parallel and Distributed Computing
Local stabilizer

Journal of Parallel and Distributed Computing - Self-stabilizing distributed systems
Derivatives: A Construct for Internet Programming

ICCL'98 Workshop on Internet Programming Languages
Performance Evaluation of Fault Tolerance for Parallel Applications in Networked Environments

ICPP '97 Proceedings of the international Conference on Parallel Processing
Efficient Garbage Collection Schemes for Causal Message Logging with Independent Checkpointing in Message Passing Systems

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Efficient Fault-Tolerant Protocol for Mobility Agents in Mobile IP

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Checkpointing and Rollback of Wide-area Distributed Applications using Mobile Agents

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Guaranteed Mutually Consistent Checkpointing in Distributed Computations

ASIAN '98 Proceedings of the 4th Asian Computing Science Conference on Advances in Computing Science
An Efficient Coordinated Checkpointing Scheme Based on PWD Model

ICOIN '02 Revised Papers from the International Conference on Information Networking, Wireless Communications Technologies and Network Applications-Part II
Fault Tolerance by Transparent Replication for Distributed Ada 95

Ada-Europe '99 Proceedings of the 1999 Ada-Europe International Conference on Reliable Software Technologies
Transparent Fault Tolerance for Web Services Based Architectures

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Improving Scalability of Replicated Services in Mobile Agent Systems

MA '02 Proceedings of the 6th International Conference on Mobile Agents
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Fault tolerant matrix operations using checksum and reverse computation

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Supporting nondeterministic execution in fault-tolerant systems

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Supporting fault-tolerance in heterogeneous distributed applications

HCW '97 Proceedings of the 6th Heterogeneous Computing Workshop (HCW '97)
Fault Tolerant Matrix Operations for Networks of Workstations Using Multiple Checkpointing

HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
Concurrent rollback for crash recovery in extended hypercube networks

PAS '95 Proceedings of the First Aizu International Symposium on Parallel Algorithms/Architecture Synthesis
Garbage collection in message passing distributed systems

PAS '95 Proceedings of the First Aizu International Symposium on Parallel Algorithms/Architecture Synthesis
Minimizing timestamp size for completely asynchronous optimistic recovery with minimal rollback

SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques

SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Logging and Recovery in Adaptive Software Distributed Shared Memory Systems

SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Optimistic Recovery in Multi-Threaded Distributed Systems

SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Deadlocks in fully uncoordinated checkpointing rollback recovery systems

WORDS '97 Proceedings of the 3rd Workshop on Object-Oriented Real-Time Dependable Systems - (WORDS '97)
Micro-Checkpointing: Checkpointing for Multithreaded Applications

IOLTW '00 Proceedings of the 6th IEEE International On-Line Testing Workshop (IOLTW)
Efficient damage assessment and repair in resilient distributed database systems

Das'01 Proceedings of the fifteenth annual working conference on Database and application security
Algorithm-Based Diskless Checkpointing for Fault-Tolerant Matrix Operations

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Completely Asynchronous Optimistic Recovery with Minimal Rollbacks

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Why Optimistic Message Logging Has Not Been Used in Telecommunications Systems

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Software Schemes of Reconfiguration and Recovery in Distributed Memory Multicomputers Using the Actor Model

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Selective Checkpointing and Rollbacks in Multithreaded Distributed Systems

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
An algorithm for Supporting Fault Tolerant Objects in Distributed Object-Oriented Operating Systems

IWOOOS '95 Proceedings of the 4th International Workshop on Object-Orientation in Operating Systems
Checkpointing and Recovery for Distributed Shared Memory Applications

IWOOOS '95 Proceedings of the 4th International Workshop on Object-Orientation in Operating Systems
Multiversioning and Logging in the Grasshopper Kernel Persistent Store

IWOOOS '95 Proceedings of the 4th International Workshop on Object-Orientation in Operating Systems
A comparative analysis of the reliability of simple and two-level checkpointing techniques in two different distributed industrial control system architectures

Systems Analysis Modelling Simulation
Efficient Causality-Tracking Timestamping

IEEE Transactions on Knowledge and Data Engineering
Towards a new distributed programming environment (CORDS)

CASCON '91 Proceedings of the 1991 conference of the Centre for Advanced Studies on Collaborative research
High-level language support for programming distributed systems

CASCON '91 Proceedings of the 1991 conference of the Centre for Advanced Studies on Collaborative research
A service acquisition mechanism for the client/service model in cygnus

CASCON '91 Proceedings of the 1991 conference of the Centre for Advanced Studies on Collaborative research
Optimistic replication in HOPE

CASCON '92 Proceedings of the 1992 conference of the Centre for Advanced Studies on Collaborative research - Volume 2
Distributed recovery with K-optimistic logging

Journal of Parallel and Distributed Computing
Causality tracking in causal message-logging protocols

Distributed Computing
Unveiling the transport

ACM SIGCOMM Computer Communication Review
A causal message logging protocol for mobile nodes in mobile computing systems

Future Generation Computer Systems - Special issue: Advanced services for clusters and internet computing
Concurrent checkpoint initiation and recovery algorithms on asynchronous ring networks

Journal of Parallel and Distributed Computing
Fingerprinting: bounding soft-error detection latency and bandwidth

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery

IEEE Transactions on Dependable and Secure Computing
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Fault-Tolerant Parallel Applications with Dynamic Parallel Schedules

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 16 - Volume 17
Efficient algorithms for optimistic crash recovery

Distributed Computing
Efficient dependency tracking for relevant events in shared-memory systems

Proceedings of the twenty-fourth annual ACM symposium on Principles of distributed computing
Detecting causal relationships in distributed computations: in search of the holy grail

Distributed Computing
Rx: treating bugs as allergies---a safe method to survive software failures

Proceedings of the twentieth ACM symposium on Operating systems principles
HPC-Colony: services and interfaces for very large systems

ACM SIGOPS Operating Systems Review
Performance analysis of different checkpointing and recovery schemes using stochastic model

Journal of Parallel and Distributed Computing
Finding a suitable checkpoint and recovery protocol for a distributed application

Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
Design, Analysis and Performance Evaluation of a New Algorithm for Developing a Fault Tolerant Distributed System

ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Incremental checkpointing with application to distributed discrete event simulation

Proceedings of the 38th conference on Winter simulation
Declarative failure recovery for sensor networks

Proceedings of the 6th international conference on Aspect-oriented software development
Quasi-atomic recovery for distributed agents

Parallel Computing
Flashback: a lightweight extension for rollback and deterministic replay for software debugging

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Exploring failure transparency and the limits of generic recovery

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Self-stabilizing algorithm for checkpointing in a distributed system

Journal of Parallel and Distributed Computing
Rethink the sync

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Transparent fault tolerance for parallel applications on networks of workstations

ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
Rx: Treating bugs as allergies—a safe method to survive software failures

ACM Transactions on Computer Systems (TOCS)
Rethink the sync

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Quantitative causality

Neural, Parallel & Scientific Computations
A Lightweight Heuristic-based Mechanism for Collecting Committed Consistent Global States in Optimistic Simulation

DS-RT '07 Proceedings of the 11th IEEE International Symposium on Distributed Simulation and Real-Time Applications
Coordinated checkpoint versus message log for fault tolerant MPI

International Journal of High Performance Computing and Networking
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

Information Sciences: an International Journal
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

Information Sciences: an International Journal
Rethink the sync

ACM Transactions on Computer Systems (TOCS)
2-step algorithm for enhancing effectiveness of sender-based message logging

SpringSim '07 Proceedings of the 2007 spring simulation multiconference - Volume 2
Lightweight log management algorithm for removing logged messages of sender processes with little overhead

WSEAS Transactions on Computers
An optimistic checkpointing and message logging approach for consistent global checkpoint collection in distributed systems

Journal of Parallel and Distributed Computing
FlashBox: a system for logging non-deterministic events in deployed embedded systems

Proceedings of the 2009 ACM symposium on Applied Computing
Efficient dependency tracking for relevant events in concurrent systems

Distributed Computing
A novel low-overhead recovery approach for distributed systems

Journal of Computer Systems, Networks, and Communications
Active Optimistic Message Logging for Reliable Execution of MPI Applications

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Message fragment based causal message logging

Journal of Parallel and Distributed Computing
Towards Zero-Delay Recovery of Agents in Production Automation Systems

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 02
A theory of nested speculative execution

COORDINATION'07 Proceedings of the 9th international conference on Coordination models and languages
Team-Based Message Logging: Preliminary Results

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Robust non-intrusive record-replay with processor extraction

Proceedings of the 8th Workshop on Parallel and Distributed Systems: Testing, Analysis, and Debugging
Improving message logging protocols scalability through distributed event logging

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Damage assessment and repair in attack resilient distributed database systems

Computer Standards & Interfaces
Coordinated checkpoint from message payload in pessimistic sender-based message logging

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Peers-for-peers (P4P): an efficient and reliable fault-tolerance strategy for cycle-stealing P2P applications

International Journal of Communication Networks and Distributed Systems
FRASystem: fault tolerant system using agents in distributed computing systems

Cluster Computing
Log-based middleware server recovery with transaction support

The VLDB Journal — The International Journal on Very Large Data Bases
PipeCloud: using causality to overcome speed-of-light delays in cloud-based disaster recovery

Proceedings of the 2nd ACM Symposium on Cloud Computing
Proactive fault tolerance in MPI applications via task migration

HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Checkpointing and communication pattern-neutral algorithm for removing messages logged by senders

HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
An asynchronous recovery algorithm based on a staggered quasi-synchronous checkpointing algorithm

IWDC'05 Proceedings of the 7th international conference on Distributed Computing
A communication-induced checkpointing and asynchronous recovery algorithm for multithreaded distributed systems

PDCAT'04 Proceedings of the 5th international conference on Parallel and Distributed Computing: applications and Technologies
Dynamic fault tolerance in distributed simulation system

ICCS'06 Proceedings of the 6th international conference on Computational Science - Volume Part I
A fault-tolerant multi-agent development framework

ISPA'04 Proceedings of the Second international conference on Parallel and Distributed Processing and Applications
A hybrid message Logging-CIC protocol for constrained checkpointability

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
A checkpoint/recovery model for heterogeneous dataflow computations using work-stealing

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
SecondSite: disaster tolerance as a service

VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments
Implementing rollback-recovery coordinated checkpoints

ISSADS'05 Proceedings of the 5th international conference on Advanced Distributed Systems
Fault-tolerant parallel applications with dynamic parallel schedules: a programmer's perspective

Dependable Systems
Research: Design of loosely coupled processes capable of time-bounded cooperative recovery: the PTC/SL scheme

Computer Communications
Research: Debugging tool for distributed Estelle programs

Computer Communications
Optimal checkpointing interval of a communication system with rollback recovery

Mathematical and Computer Modelling: An International Journal
Fast recovery from database/link failures in mobile networks

Computer Communications
Independent checkpointing in a heterogeneous grid environment

Future Generation Computer Systems
Ensuring reliability in B2B services: Fault tolerant inter-organizational workflows

Information Systems Frontiers
RemusDB: transparent high availability for database systems

The VLDB Journal — The International Journal on Very Large Data Bases
Escape capsule: explicit state is robust and scalable

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Pico replication: a high availability framework for middleboxes

Proceedings of the 4th annual Symposium on Cloud Computing
HotSnap: a hot distributed snapshot system for virtual machine cluster

LISA'13 Proceedings of the 27th international conference on Large Installation System Administration

Quantified Score

Hi-index	0.04

Visualization

Abstract

Optimistic Recovery is a new technique supporting application-independent transparent recovery from processor failures in distributed systems. In optimistic recovery communication, computation and checkpointing proceed asynchronously. Synchronization is replaced by causal dependency tracking, which enables a posteriori reconstruction of a consistent distributed system state following a failure using process rollback and message replay.Because there is no synchronization among computation, communication, and checkpointing, optimistic recovery can tolerate the failure of an arbitrary number of processors and yields better throughput and response time than other general recovery techniques whenever failures are infrequent.