Computer networks
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
The Recovery Manager of the System R Database Manager
ACM Computing Surveys (CSUR)
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Guardians and actions: linguistic support for robust, distributed programs
POPL '82 Proceedings of the 9th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Recovery semantics for a DB/DC system
ACM '73 Proceedings of the ACM annual conference
Recovery scenario for a DB/DC system
ACM '73 Proceedings of the ACM annual conference
A message system supporting fault tolerance
SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Efficient commit protocols for the tree of processes model of distributed transactions
PODC '83 Proceedings of the second annual ACM symposium on Principles of distributed computing
Progressive transaction recovery in distributed DB/DC systems
IEEE Transactions on Computers - Special Issue on Real-Time Systems
Exploiting virtual synchrony in distributed systems
SOSP '87 Proceedings of the eleventh ACM Symposium on Operating systems principles
Debugging concurrent processes: a case study
PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
ACM Transactions on Computer Systems (TOCS)
Recovery in distributed systems using asynchronous message logging and checkpointing
PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Preserving and using context information in interprocess communication
ACM Transactions on Computer Systems (TOCS)
A graphical representation of concurrent processes
PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
Programming languages for distributed computing systems
ACM Computing Surveys (CSUR)
Efficient distributed recovery using message logging
Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Demonic memory for process histories
PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
Fault-tolerant computing based on Mach
ACM SIGOPS Operating Systems Review
Communication with directed logic variables
POPL '91 Proceedings of the 18th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Understanding fault-tolerant distributed systems
Communications of the ACM
Optimistic parallelization of communicating sequential processes
PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
Replay, recovery, replication, and snapshots of nondeterministic concurrent programs
PODC '91 Proceedings of the tenth annual ACM symposium on Principles of distributed computing
Transparent recovery in distributed systems (position paper)
ACM SIGOPS Operating Systems Review
Transparent optimistic rollback recovery
ACM SIGOPS Operating Systems Review
Restoring consistent global states of distributed computations
PADD '91 Proceedings of the 1991 ACM/ONR workshop on Parallel and distributed debugging
About logical clocks for distributed systems
ACM SIGOPS Operating Systems Review
Optimistic Make (Software Design)
IEEE Transactions on Computers
Design and performance of multipath MIN architectures
SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
An abstract model of rollback recovery control in distributed systems
ACM SIGOPS Operating Systems Review
A checkpointing recovery approach in a distributed system on the CSMA/CD network
SAC '92 Proceedings of the 1992 ACM/SIGAPP Symposium on Applied computing: technological challenges of the 1990's
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Adaptive message logging for incremental replay of message-passing programs
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Propagation of authorizations in distributed database systems
CCS '94 Proceedings of the 2nd ACM Conference on Computer and communications security
A checkpoint protocol for an entry consistent shared memory system
PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems.
IEEE Transactions on Parallel and Distributed Systems
On distributed object checkpointing and recovery
Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing
On the relevance of communication costs of rollback-recovery protocols
Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing
Formal semantics for expressing optimism: the meaning of HOPE
Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing
A unified approach to fault-tolerance in communication protocols based on recovery procedures
IEEE/ACM Transactions on Networking (TON)
Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems
IEEE Transactions on Parallel and Distributed Systems
An Architecture for Tolerating Processor Failures in Shared-Memory Multiprocessors
IEEE Transactions on Computers
Trade-offs in implementing causal message logging protocols
PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
Optimistic Crash Recovery without Changing Application Messages
IEEE Transactions on Parallel and Distributed Systems
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints
IEEE Transactions on Computers
Progressive Retry for Software Failure Recovery in Message-Passing Applications
IEEE Transactions on Computers
Optimistic distributed simulation based on transitive dependency tracking
Proceedings of the eleventh workshop on Parallel and distributed simulation
Efficient transparent application recovery in client-server information systems
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Persistent messages in local transactions
PODC '98 Proceedings of the seventeenth annual ACM symposium on Principles of distributed computing
Fault-tolerant distributed simulation
PADS '98 Proceedings of the twelfth workshop on Parallel and distributed simulation
Damage Assessment for Optimal Rollback Recovery
IEEE Transactions on Computers
Support for Software Interrupts in Log-Based Rollback-Recovery
IEEE Transactions on Computers
IEEE Transactions on Parallel and Distributed Systems
On Coordinated Checkpointing in Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Logical logging to extend recovery to new domains
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Fast cluster failover using virtual memory-mapped communication
ICS '99 Proceedings of the 13th international conference on Supercomputing
Checkpointing and rollback-recovery for distributed systems
ACM '86 Proceedings of 1986 ACM Fall joint computer conference
Statically Safe Speculative Execution for Real-Time Systems
IEEE Transactions on Software Engineering
The Journal of Supercomputing
Scalable fault-tolerant distributed shared memory
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Mutable Checkpoints: A New Checkpointing Approach for Mobile Computing Systems
IEEE Transactions on Parallel and Distributed Systems
Transparent recovery in distributed systems
EW 4 Proceedings of the 4th workshop on ACM SIGOPS European workshop
Transparent optimistic rollback recovery
EW 4 Proceedings of the 4th workshop on ACM SIGOPS European workshop
Fault-tolerant parallel computing
EW 4 Proceedings of the 4th workshop on ACM SIGOPS European workshop
Operating system level support for coherence in distributed systems
EW 5 Proceedings of the 5th workshop on ACM SIGOPS European workshop: Models and paradigms for distributed systems structuring
Easing the management of data-parallel systems via adaptation
EW 9 Proceedings of the 9th workshop on ACM SIGOPS European workshop: beyond the PC: new challenges for the operating system
Lazy Garbage Collection of Recovery State for Fault-Tolerant Distributed Shared Memory
IEEE Transactions on Parallel and Distributed Systems
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Efficient Garbage Collection Schemes for Causal Message Logging with Independent Checkpointing
The Journal of Supercomputing
Toward Ubiquitous Environments for Mobile Users
IEEE Internet Computing
Adaptive Message Logging for Incremental Program Replay
IEEE Parallel & Distributed Technology: Systems & Technology
Complete Process Recovery: Using Vector Time to Handle Multiple Failures in Distributed Systems
IEEE Parallel & Distributed Technology: Systems & Technology
Recovering from Multiple Process Failures in the Time Warp Mechanism
IEEE Transactions on Computers
The Cost of Recovery in Message Logging Protocols
IEEE Transactions on Knowledge and Data Engineering
Error Recovery in Shared Memory Multiprocessors Using Private Caches
IEEE Transactions on Parallel and Distributed Systems
Rollback Recovery in Distributed Systems Using Loosely Synchronized Clocks
IEEE Transactions on Parallel and Distributed Systems
An Efficient Protocol for Checkpointing Recovery in Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Efficient Rollback-Recovery Technique in Distributed Computing Systems
IEEE Transactions on Parallel and Distributed Systems
Lazy Garbage Collection of Recovery State for Fault-Tolerant Distributed Shared Memory
IEEE Transactions on Parallel and Distributed Systems
Message Logging: Pessimistic, Optimistic, Causal, and Optimal
IEEE Transactions on Software Engineering
Checkpointing with mutable checkpoints
Theoretical Computer Science - Dependable computing
Asynchronous recovery without using vector timestamps
Journal of Parallel and Distributed Computing
Journal of Parallel and Distributed Computing - Self-stabilizing distributed systems
Derivatives: A Construct for Internet Programming
ICCL'98 Workshop on Internet Programming Languages
Performance Evaluation of Fault Tolerance for Parallel Applications in Networked Environments
ICPP '97 Proceedings of the international Conference on Parallel Processing
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Efficient Fault-Tolerant Protocol for Mobility Agents in Mobile IP
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Checkpointing and Rollback of Wide-area Distributed Applications using Mobile Agents
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Guaranteed Mutually Consistent Checkpointing in Distributed Computations
ASIAN '98 Proceedings of the 4th Asian Computing Science Conference on Advances in Computing Science
An Efficient Coordinated Checkpointing Scheme Based on PWD Model
ICOIN '02 Revised Papers from the International Conference on Information Networking, Wireless Communications Technologies and Network Applications-Part II
Fault Tolerance by Transparent Replication for Distributed Ada 95
Ada-Europe '99 Proceedings of the 1999 Ada-Europe International Conference on Reliable Software Technologies
Transparent Fault Tolerance for Web Services Based Architectures
Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Improving Scalability of Replicated Services in Mobile Agent Systems
MA '02 Proceedings of the 6th International Conference on Mobile Agents
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Fault tolerant matrix operations using checksum and reverse computation
FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Supporting nondeterministic execution in fault-tolerant systems
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Supporting fault-tolerance in heterogeneous distributed applications
HCW '97 Proceedings of the 6th Heterogeneous Computing Workshop (HCW '97)
Fault Tolerant Matrix Operations for Networks of Workstations Using Multiple Checkpointing
HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
Concurrent rollback for crash recovery in extended hypercube networks
PAS '95 Proceedings of the First Aizu International Symposium on Parallel Algorithms/Architecture Synthesis
Garbage collection in message passing distributed systems
PAS '95 Proceedings of the First Aizu International Symposium on Parallel Algorithms/Architecture Synthesis
Minimizing timestamp size for completely asynchronous optimistic recovery with minimal rollback
SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Logging and Recovery in Adaptive Software Distributed Shared Memory Systems
SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Optimistic Recovery in Multi-Threaded Distributed Systems
SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Deadlocks in fully uncoordinated checkpointing rollback recovery systems
WORDS '97 Proceedings of the 3rd Workshop on Object-Oriented Real-Time Dependable Systems - (WORDS '97)
Micro-Checkpointing: Checkpointing for Multithreaded Applications
IOLTW '00 Proceedings of the 6th IEEE International On-Line Testing Workshop (IOLTW)
Efficient damage assessment and repair in resilient distributed database systems
Das'01 Proceedings of the fifteenth annual working conference on Database and application security
Algorithm-Based Diskless Checkpointing for Fault-Tolerant Matrix Operations
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Completely Asynchronous Optimistic Recovery with Minimal Rollbacks
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Why Optimistic Message Logging Has Not Been Used in Telecommunications Systems
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Selective Checkpointing and Rollbacks in Multithreaded Distributed Systems
ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
An algorithm for Supporting Fault Tolerant Objects in Distributed Object-Oriented Operating Systems
IWOOOS '95 Proceedings of the 4th International Workshop on Object-Orientation in Operating Systems
Checkpointing and Recovery for Distributed Shared Memory Applications
IWOOOS '95 Proceedings of the 4th International Workshop on Object-Orientation in Operating Systems
Multiversioning and Logging in the Grasshopper Kernel Persistent Store
IWOOOS '95 Proceedings of the 4th International Workshop on Object-Orientation in Operating Systems
Efficient Causality-Tracking Timestamping
IEEE Transactions on Knowledge and Data Engineering
Towards a new distributed programming environment (CORDS)
CASCON '91 Proceedings of the 1991 conference of the Centre for Advanced Studies on Collaborative research
High-level language support for programming distributed systems
CASCON '91 Proceedings of the 1991 conference of the Centre for Advanced Studies on Collaborative research
A service acquisition mechanism for the client/service model in cygnus
CASCON '91 Proceedings of the 1991 conference of the Centre for Advanced Studies on Collaborative research
Optimistic replication in HOPE
CASCON '92 Proceedings of the 1992 conference of the Centre for Advanced Studies on Collaborative research - Volume 2
Distributed recovery with K-optimistic logging
Journal of Parallel and Distributed Computing
Causality tracking in causal message-logging protocols
Distributed Computing
ACM SIGCOMM Computer Communication Review
A causal message logging protocol for mobile nodes in mobile computing systems
Future Generation Computer Systems - Special issue: Advanced services for clusters and internet computing
Concurrent checkpoint initiation and recovery algorithms on asynchronous ring networks
Journal of Parallel and Distributed Computing
Fingerprinting: bounding soft-error detection latency and bandwidth
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery
IEEE Transactions on Dependable and Secure Computing
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Fault-Tolerant Parallel Applications with Dynamic Parallel Schedules
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 16 - Volume 17
Efficient algorithms for optimistic crash recovery
Distributed Computing
Efficient dependency tracking for relevant events in shared-memory systems
Proceedings of the twenty-fourth annual ACM symposium on Principles of distributed computing
Detecting causal relationships in distributed computations: in search of the holy grail
Distributed Computing
Rx: treating bugs as allergies---a safe method to survive software failures
Proceedings of the twentieth ACM symposium on Operating systems principles
HPC-Colony: services and interfaces for very large systems
ACM SIGOPS Operating Systems Review
Performance analysis of different checkpointing and recovery schemes using stochastic model
Journal of Parallel and Distributed Computing
Finding a suitable checkpoint and recovery protocol for a distributed application
Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Incremental checkpointing with application to distributed discrete event simulation
Proceedings of the 38th conference on Winter simulation
Declarative failure recovery for sensor networks
Proceedings of the 6th international conference on Aspect-oriented software development
Quasi-atomic recovery for distributed agents
Parallel Computing
Flashback: a lightweight extension for rollback and deterministic replay for software debugging
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Exploring failure transparency and the limits of generic recovery
OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Self-stabilizing algorithm for checkpointing in a distributed system
Journal of Parallel and Distributed Computing
OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Transparent fault tolerance for parallel applications on networks of workstations
ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
Rx: Treating bugs as allergies—a safe method to survive software failures
ACM Transactions on Computer Systems (TOCS)
OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Neural, Parallel & Scientific Computations
DS-RT '07 Proceedings of the 11th IEEE International Symposium on Distributed Simulation and Real-Time Applications
Coordinated checkpoint versus message log for fault tolerant MPI
International Journal of High Performance Computing and Networking
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage
Information Sciences: an International Journal
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage
Information Sciences: an International Journal
ACM Transactions on Computer Systems (TOCS)
2-step algorithm for enhancing effectiveness of sender-based message logging
SpringSim '07 Proceedings of the 2007 spring simulation multiconference - Volume 2
WSEAS Transactions on Computers
Journal of Parallel and Distributed Computing
FlashBox: a system for logging non-deterministic events in deployed embedded systems
Proceedings of the 2009 ACM symposium on Applied Computing
Efficient dependency tracking for relevant events in concurrent systems
Distributed Computing
A novel low-overhead recovery approach for distributed systems
Journal of Computer Systems, Networks, and Communications
Active Optimistic Message Logging for Reliable Execution of MPI Applications
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Message fragment based causal message logging
Journal of Parallel and Distributed Computing
Towards Zero-Delay Recovery of Agents in Production Automation Systems
WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 02
A theory of nested speculative execution
COORDINATION'07 Proceedings of the 9th international conference on Coordination models and languages
Team-Based Message Logging: Preliminary Results
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Robust non-intrusive record-replay with processor extraction
Proceedings of the 8th Workshop on Parallel and Distributed Systems: Testing, Analysis, and Debugging
Improving message logging protocols scalability through distributed event logging
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Damage assessment and repair in attack resilient distributed database systems
Computer Standards & Interfaces
Coordinated checkpoint from message payload in pessimistic sender-based message logging
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
International Journal of Communication Networks and Distributed Systems
Log-based middleware server recovery with transaction support
The VLDB Journal — The International Journal on Very Large Data Bases
PipeCloud: using causality to overcome speed-of-light delays in cloud-based disaster recovery
Proceedings of the 2nd ACM Symposium on Cloud Computing
Proactive fault tolerance in MPI applications via task migration
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Checkpointing and communication pattern-neutral algorithm for removing messages logged by senders
HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
An asynchronous recovery algorithm based on a staggered quasi-synchronous checkpointing algorithm
IWDC'05 Proceedings of the 7th international conference on Distributed Computing
PDCAT'04 Proceedings of the 5th international conference on Parallel and Distributed Computing: applications and Technologies
Dynamic fault tolerance in distributed simulation system
ICCS'06 Proceedings of the 6th international conference on Computational Science - Volume Part I
A fault-tolerant multi-agent development framework
ISPA'04 Proceedings of the Second international conference on Parallel and Distributed Processing and Applications
A hybrid message Logging-CIC protocol for constrained checkpointability
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
A checkpoint/recovery model for heterogeneous dataflow computations using work-stealing
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
SecondSite: disaster tolerance as a service
VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments
Implementing rollback-recovery coordinated checkpoints
ISSADS'05 Proceedings of the 5th international conference on Advanced Distributed Systems
Research: Debugging tool for distributed Estelle programs
Computer Communications
Optimal checkpointing interval of a communication system with rollback recovery
Mathematical and Computer Modelling: An International Journal
Fast recovery from database/link failures in mobile networks
Computer Communications
Independent checkpointing in a heterogeneous grid environment
Future Generation Computer Systems
Ensuring reliability in B2B services: Fault tolerant inter-organizational workflows
Information Systems Frontiers
RemusDB: transparent high availability for database systems
The VLDB Journal — The International Journal on Very Large Data Bases
Escape capsule: explicit state is robust and scalable
HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Pico replication: a high availability framework for middleboxes
Proceedings of the 4th annual Symposium on Cloud Computing
HotSnap: a hot distributed snapshot system for virtual machine cluster
LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
Hi-index | 0.04 |
Optimistic Recovery is a new technique supporting application-independent transparent recovery from processor failures in distributed systems. In optimistic recovery communication, computation and checkpointing proceed asynchronously. Synchronization is replaced by causal dependency tracking, which enables a posteriori reconstruction of a consistent distributed system state following a failure using process rollback and message replay.Because there is no synchronization among computation, communication, and checkpointing, optimistic recovery can tolerate the failure of an arbitrary number of processors and yields better throughput and response time than other general recovery techniques whenever failures are infrequent.