A survey of rollback-recovery protocols in message-passing systems

Authors:
E. N. (Mootaz) Elnozahy;Lorenzo Alvisi;Yi-Min Wang;David B. Johnson
Affiliations:
IBM Research, Austin, TX;The University of Texas at Austin, Austin, TX;Microsoft Research, Redmond, WA;Rice University, Houston, TX
Venue:
ACM Computing Surveys (CSUR)
Year:
2002

Citing 42
Cited 290

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
On distributed snapshots

Information Processing Letters
Fault tolerance under UNIX

ACM Transactions on Computer Systems (TOCS)
A software instruction counter

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
IGOR: a system for program debugging via reversible execution

PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
Efficient distributed recovery using message logging

Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Recovery in distributed systems using optimistic message logging and check-pointing

Journal of Algorithms
Efficient checkpointing on MIMD architectures

Efficient checkpointing on MIMD architectures
Space reclamation for uncoordinated checkpointing in message-passing systems

Space reclamation for uncoordinated checkpointing in message-passing systems
Necessary and Sufficient Conditions for Consistent Global Snapshots

IEEE Transactions on Parallel and Distributed Systems
Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems.

IEEE Transactions on Parallel and Distributed Systems
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Hypervisor-based fault tolerance

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Understanding the message logging paradigm for masking process crashes

Understanding the message logging paradigm for masking process crashes
Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems

IEEE Transactions on Parallel and Distributed Systems
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints

IEEE Transactions on Computers
Application level fault tolerance in heterogeneous networks of workstations

Journal of Parallel and Distributed Computing
A Survey of Recoverable Distributed Shared Virtual Memory Systems

IEEE Transactions on Parallel and Distributed Systems
Support for Software Interrupts in Log-Based Rollback-Recovery

IEEE Transactions on Computers
Fail-stop processors: an approach to designing fault-tolerant computing systems

ACM Transactions on Computer Systems (TOCS)
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Rollback Recovery in Distributed Systems Using Loosely Synchronized Clocks

IEEE Transactions on Parallel and Distributed Systems
Message Logging: Pessimistic, Optimistic, Causal, and Optimal

IEEE Transactions on Software Engineering
Performance of Consistent Checkpointing in a Modular Operating System: Results of the FTM Experiment

EDCC-1 Proceedings of the First European Dependable Computing Conference on Dependable Computing
Ensuring Data Security and Integrity with a Fast Stable Storage

Proceedings of the Fourth International Conference on Data Engineering
Experimental Evaluation of Concurrency Checkpointing and Rollback-Recovery Algorithms

Proceedings of the Sixth International Conference on Data Engineering
Virtual Precedence in Asynchronous Systems: Cencept and Applications

WDAG '97 Proceedings of the 11th International Workshop on Distributed Algorithms
Probabilistic Checkpointing

FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
How Safe is Probabilistic Checkpointing?

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
An Analysis of Communication-Induced Checkpointing

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
A NonStop kernel

SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
Converting a swap-based system to do paging in an architecture lacking page-referenced bits

SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
Minimizing timestamp size for completely asynchronous optimistic recovery with minimal rollback

SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Preventing Useless Checkpoints in Distributed Computations

SRDS '97 Proceedings of the 16th Symposium on Reliable Distributed Systems
A VP-Accordant Checkpointing Protocol Preventing Useless Checkpoints

SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
The Cost of Recovery in Message Logging Protocols

SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
Why Optimistic Message Logging Has Not Been Used in Telecommunications Systems

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Adding input and output to the transactional model

Adding input and output to the transactional model
Distributed system fault tolerance using message logging and checkpointing

Distributed system fault tolerance using message logging and checkpointing
Manetho: fault tolerance in distributed systems using rollback-recovery and process replication

Manetho: fault tolerance in distributed systems using rollback-recovery and process replication
Libckpt: transparent checkpointing under Unix

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings

Checkpoint-Recovery for Mobile Intelligent Networks

Proceedings of the 14th International conference on Industrial and engineering applications of artificial intelligence and expert systems: engineering of intelligent systems
MPICH-CM: A Communication Library Design for a P2P MPI Implementation

Proceedings of the 9th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
ReVirt: enabling intrusion analysis through virtual-machine logging and replay

ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
On Properties of RDT Communication-Induced Checkpointing Protocols

IEEE Transactions on Parallel and Distributed Systems
Distributed recovery with K-optimistic logging

Journal of Parallel and Distributed Computing
Unveiling the transport

ACM SIGCOMM Computer Communication Review
Improving Logging and Recovery Performance in Phoenix/App

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Energy-aware deterministic fault tolerance in distributed real-time embedded systems

Proceedings of the 41st annual Design Automation Conference
Brief announcement: optimal asynchronous garbage collection for checkpointing protocols with rollback-dependency trackability

Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing
Recovery guarantees for Internet applications

ACM Transactions on Internet Technology (TOIT)
Concurrent checkpoint initiation and recovery algorithms on asynchronous ring networks

Journal of Parallel and Distributed Computing
Checkpointing-based rollback recovery for parallel applications on the InteGrade grid middleware

MGC '04 Proceedings of the 2nd workshop on Middleware for grid computing
Recovery-Oriented Computing: Building Multitier Dependability

Computer
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery

IEEE Transactions on Dependable and Secure Computing
Replication for web hosting systems

ACM Computing Surveys (CSUR)
Replication for web hosting systems

ACM Computing Surveys (CSUR)
High-Availability Algorithms for Distributed Stream Processing

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Impact of Event Logger on Causal Message Logging Protocols for Fault Tolerant MPI

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Event Logging: Portable and Efficient Checkpointing in Heterogeneous Environments with Non-FIFO Communication Platforms

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 1 - Volume 02
Fault-Tolerant Parallel Applications with Dynamic Parallel Schedules

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 16 - Volume 17
ReVirt: enabling intrusion analysis through virtual-machine logging and replay

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Fault-tolerance in the Borealis distributed stream processing system

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
BugNet: Continuously Recording Program Execution for Deterministic Replay Debugging

Proceedings of the 32nd annual international symposium on Computer Architecture
A channel memory based fault tolerance for MPI applications

Future Generation Computer Systems - Special issue: Parallel computing technologies
Surviving Errors in Component-Based Software

EUROMICRO '05 Proceedings of the 31st EUROMICRO Conference on Software Engineering and Advanced Applications
Vigilante: end-to-end containment of internet worms

Proceedings of the twentieth ACM symposium on Operating systems principles
Speculative execution in a distributed file system

Proceedings of the twentieth ACM symposium on Operating systems principles
Rx: treating bugs as allergies---a safe method to survive software failures

Proceedings of the twentieth ACM symposium on Operating systems principles
Using Consistent Global Checkpoints to Synchronize Processes in Distributed Simulation

DS-RT '05 Proceedings of the 9th IEEE International Symposium on Distributed Simulation and Real-Time Applications
An Efficient Index-Based Checkpointing Protocol with Constant-Size Control Information on Messages

IEEE Transactions on Dependable and Secure Computing
Strategies for storage of checkpointing data using non-dedicated repositories on Grid systems

MGC '05 Proceedings of the 3rd international workshop on Middleware for grid computing
Design and Implementation of Multiple Fault-Tolerant MPI over Myrinet (M^3)

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
A Faster Checkpointing and Recovery Algorithm with a Hierarchical Storage Approach

HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
A virtual machine monitor for utilizing non-dedicated clusters

Proceedings of the twentieth ACM symposium on Operating systems principles
Fast and transparent recovery for continuous availability of cluster-based servers

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Transparent State Management for Optimistic Synchronization in the High Level Architecture

Simulation
Evaluation of a Fault-Tolerance Mechanism for HLA-Based Distributed Simulations

Proceedings of the 20th Workshop on Principles of Advanced and Distributed Simulation
Performance analysis of different checkpointing and recovery schemes using stochastic model

Journal of Parallel and Distributed Computing
Finding a suitable checkpoint and recovery protocol for a distributed application

Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
Design, Analysis and Performance Evaluation of a New Algorithm for Developing a Fault Tolerant Distributed System

ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
Stabilizers: a modular checkpointing abstraction for concurrent functional programs

Proceedings of the eleventh ACM SIGPLAN international conference on Functional programming
In-network fault tolerance in networked sensor systems

DIWANS '06 Proceedings of the 2006 workshop on Dependability issues in wireless ad hoc networks and sensor networks
EOS2: unstoppable stateful PHP

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Checkpointing and rollback-recovery protocol integrated with VsSG protocol for RYW session guarantee

PDCN'06 Proceedings of the 24th IASTED international conference on Parallel and distributed computing and networks
Distributed data storage for opportunistic grids

Proceedings of the 3rd international Middleware doctoral symposium
Strategies for Checkpoint Storage on Opportunistic Grids

IEEE Distributed Systems Online
Implementing fault-tolerance in real-time systems by automatic program transformations

EMSOFT '06 Proceedings of the 6th ACM & IEEE International conference on Embedded software
ExecRecorder: VM-based full-system replay for attack analysis and system recovery

Proceedings of the 1st workshop on Architectural and system support for improving software dependability
Toward real-time image guided neurosurgery using distributed and grid computing

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Speculative execution in a distributed file system

ACM Transactions on Computer Systems (TOCS)
Efficient hardware checkpointing: concepts, overhead analysis, and implementation

Proceedings of the 2007 ACM/SIGDA 15th international symposium on Field programmable gate arrays
Declarative failure recovery for sensor networks

Proceedings of the 6th international conference on Aspect-oriented software development
Quasi-atomic recovery for distributed agents

Parallel Computing
Framework for instruction-level tracing and analysis of program executions

Proceedings of the 2nd international conference on Virtual execution environments
Flashback: a lightweight extension for rollback and deterministic replay for software debugging

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Log-based recovery for middleware servers

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Static analysis meets distributed fault-tolerance: enabling state-machine replication with nondeterminism

HOTDEP'06 Proceedings of the 2nd conference on Hot Topics in System Dependability - Volume 2
WiDS: an integrated toolkit for distributed system development

HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
Detecting targeted attacks using shadow honeypots

SSYM'05 Proceedings of the 14th conference on USENIX Security Symposium - Volume 14
On the Complexity of Removing Z-Cycles from a Checkpoints and Communication Pattern

IEEE Transactions on Computers
Self-stabilizing algorithm for checkpointing in a distributed system

Journal of Parallel and Distributed Computing
Rethink the sync

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Modular Checkpointing for Atomicity

Electronic Notes in Theoretical Computer Science (ENTCS)
Rx: Treating bugs as allergies—a safe method to survive software failures

ACM Transactions on Computer Systems (TOCS)
Efficient checkpointing of java software using context-sensitive capture and replay

Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering
Modeling and design of fault-tolerant and self-adaptive reconfigurable networked embedded systems

EURASIP Journal on Embedded Systems
Bouncer: securing software by blocking bad input

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Transactions with isolation and cooperation

Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications
Rethink the sync

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Automated Rule-Based Diagnosis through a Distributed Monitor System

IEEE Transactions on Dependable and Secure Computing
Fault-tolerance in the borealis distributed stream processing system

ACM Transactions on Database Systems (TODS)
A Lightweight Heuristic-based Mechanism for Collecting Committed Consistent Global States in Optimistic Simulation

DS-RT '07 Proceedings of the 11th IEEE International Symposium on Distributed Simulation and Real-Time Applications
Execution replay of multiprocessor virtual machines

Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Better bug reporting with better privacy

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
A survey of linguistic structures for application-level fault tolerance

ACM Computing Surveys (CSUR)
Information Assurance: Dependability and Security in Networked Systems

Information Assurance: Dependability and Security in Networked Systems
Model-based performance evaluation of distributed checkpointing protocols

Performance Evaluation
A synchronous checkpointing protocol for mobile distributed systems: probabilistic approach

International Journal of Information and Computer Security
Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Data-stream-based global event monitoring using pairwise interactions

Journal of Parallel and Distributed Computing
Dependability evaluation of dedicated server group orphan detection method

ICS'05 Proceedings of the 9th WSEAS International Conference on Systems
Preventing of burst traffic in DSG method

ICS'05 Proceedings of the 9th WSEAS International Conference on Systems
Improvement of DSG method

AMCOS'05 Proceedings of the 4th WSEAS International Conference on Applied Mathematics and Computer Science
A low-cost hybrid coordinated checkpointing protocol for mobile distributed systems

Mobile Information Systems
Communication analysis of distributed programs

Scientific Programming - Parallel/High-Performance Object-Oriented Scientific Computing (POOSC '05), Glasgow, UK, 25 July 2005
Designing and implementing malicious hardware

LEET'08 Proceedings of the 1st Usenix Workshop on Large-Scale Exploits and Emergent Threats
Rethink the sync

ACM Transactions on Computer Systems (TOCS)
Synthesis of fault-tolerant embedded systems

Proceedings of the conference on Design, automation and test in Europe
2-step algorithm for enhancing effectiveness of sender-based message logging

SpringSim '07 Proceedings of the 2007 spring simulation multiconference - Volume 2
Novel log management for sender-based message logging

ICAI'08 Proceedings of the 9th WSEAS International Conference on International Conference on Automation and Information
Handling Emergent Nondeterminism in Replicated Services

Architecting Dependable Systems V
Providing Non-stop Service for Message-Passing Based Parallel Applications with RADIC

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Fault-tolerant stream processing using a distributed, replicated file system

Proceedings of the VLDB Endowment
Vigilante: End-to-end containment of Internet worm epidemics

ACM Transactions on Computer Systems (TOCS)
Lightweight log management algorithm for removing logged messages of sender processes with little overhead

WSEAS Transactions on Computers
The implementation and evaluation of a recovery system for workflows

Journal of Network and Computer Applications
An optimistic checkpointing and message logging approach for consistent global checkpoint collection in distributed systems

Journal of Parallel and Distributed Computing
FINE: A Fully Informed aNd Efficient communication-induced checkpointing protocol for distributed systems

Journal of Parallel and Distributed Computing
Engineering of Software-Intensive Systems: State of the Art and Research Challenges

Software-Intensive Systems and New Computing Paradigms
A novel fault-tolerant execution model by using of mobile agents

Journal of Network and Computer Applications
Reconfiguration Strategies for Environmentally Powered Devices: Theoretical Analysis and Experimental Validation

Transactions on High-Performance Embedded Architectures and Compilers I
Sensornet Checkpointing: Enabling Repeatability in Testbeds and Realism in Simulations

EWSN '09 Proceedings of the 6th European Conference on Wireless Sensor Networks
Recovery domains: an organizing principle for recoverable operating systems

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Numerical computation algorithms for sequential checkpoint placement

Performance Evaluation
Algorithm-based fault tolerance applied to high performance computing

Journal of Parallel and Distributed Computing
Transparent checkpoints of closed distributed systems in Emulab

Proceedings of the 4th ACM European conference on Computer systems
A Checkpointing Method with Small Checkpoint Latency

IEICE - Transactions on Information and Systems
A systematic approach to system state restoration during storage controller micro-recovery

FAST '09 Proccedings of the 7th conference on File and storage technologies
RT-replayer: a record-replay architecture for embedded real-time software debugging

Proceedings of the 2009 ACM symposium on Applied Computing
FlashBox: a system for logging non-deterministic events in deployed embedded systems

Proceedings of the 2009 ACM symposium on Applied Computing
Dependability, Abstraction, and Programming

DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
Practical and low-overhead masking of failures of TCP-based servers

ACM Transactions on Computer Systems (TOCS)
Interconnect agnostic checkpoint/restart in open MPI

Proceedings of the 18th ACM international symposium on High performance distributed computing
In-field healing of integration problems with COTS components

ICSE '09 Proceedings of the 31st International Conference on Software Engineering
Characterizing fault tolerance in genetic programming

BADS '09 Proceedings of the 2009 workshop on Bio-inspired algorithms for distributed systems
Tolerating latency in replicated state machines through client speculation

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
FlashLogging: exploiting flash devices for synchronous logging performance

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities

International Journal of High Performance Computing Applications
Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Active Optimistic Message Logging for Reliable Execution of MPI Applications

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Message fragment based causal message logging

Journal of Parallel and Distributed Computing
Adapting grid applications to safety using fault-tolerant methods: Design, implementation and evaluations

Future Generation Computer Systems
Towards Zero-Delay Recovery of Agents in Production Automation Systems

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 02
Toward Exascale Resilience

International Journal of High Performance Computing Applications
Design optimization of time-and cost-constrained fault-tolerant embedded systems with checkpointing and replication

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Customizable execution environments with virtual desktop grid computing

PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems
PLFS: a checkpoint filesystem for parallel applications

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
VGrADS: enabling e-Science workflows on grids and clouds with fault tolerance

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
FALCON: a system for reliable checkpoint recovery in shared grid environments

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Performance analysis of mobile agent failure recovery in e-service applications

Computer Standards & Interfaces
R-ECS: reliable elastic computing services for building virtual computing environment

Proceedings of the 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human
DSF: a common platform for distributed systems research and development

Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
An empirical study of high availability in stream processing systems

Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
A fault-tolerant strategy for virtualized HPC clusters

The Journal of Supercomputing
Autonomic fault mitigation in embedded systems

Engineering Applications of Artificial Intelligence
A Channel Memory based fault tolerance for MPI applications

Future Generation Computer Systems - Special issue: Parallel computing technologies
A load balancing fault-tolerant algorithm for heterogeneous cluster environments

Neural, Parallel & Scientific Computations
A weighted checkpointing protocol for mobile distributed systems

International Journal of Ad Hoc and Ubiquitous Computing
Failure-aware resource management for high-availability computing clusters with distributed virtual machines

Journal of Parallel and Distributed Computing
JaceV: a programming and execution environment for asynchronous iterative computations on volatile nodes

VECPAR'06 Proceedings of the 7th international conference on High performance computing for computational science
A pattern-based approach for modeling and analyzing error recovery

Architecting dependable systems IV
Characterizing fault tolerance in genetic programming

Future Generation Computer Systems
An efficient handoff strategy for mobile computing checkpoint system

EUC'07 Proceedings of the 2007 international conference on Embedded and ubiquitous computing
Schedulable online testing framework for real-time embedded applications in VM

EUC'07 Proceedings of the 2007 international conference on Embedded and ubiquitous computing
A scalable asynchronous replication-based strategy for fault tolerant MPI applications

HiPC'07 Proceedings of the 14th international conference on High performance computing
A novel fault-tolerant parallel algorithm

APPT'07 Proceedings of the 7th international conference on Advanced parallel processing technologies
CPPC-G: fault-tolerant applications on the grid

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
An uncoordinated asynchronous checkpointing model for hierarchical scientific workflows

Journal of Computer and System Sciences
DSF: a common platform for distributed systems research and development

Middleware'09 Proceedings of the ACM/IFIP/USENIX 10th international conference on Middleware
Performance evaluation of an application-level checkpointing solution on grids

Future Generation Computer Systems
Enabling replication in the ASSISTANT programming model

Proceedings of the 6th International Wireless Communications and Mobile Computing Conference
Analysis of service availability for time-triggered rejuvenation policies

Journal of Systems and Software
Lightweight checkpointing for concurrent ml

Journal of Functional Programming
A cost model for autonomic reconfigurations in high-performance pervasive applications

Proceedings of the 4th ACM International Workshop on Context-Awareness for Self-Managing Systems
Automatic workarounds for web applications

Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
DOM transactions for testing JavaScript

TAIC PART'10 Proceedings of the 5th international academic and industrial conference on Testing - practice and research techniques
Communicating transactions

CONCUR'10 Proceedings of the 21st international conference on Concurrency theory
Improving message logging protocols scalability through distributed event logging

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Checkpoint/restart-enabled parallel debugging

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Recent advances in checkpoint/recovery systems

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Checkpointing and rollback-recovery protocol for mobile systems with MW session guarantee

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Behavioral simulations in MapReduce

Proceedings of the VLDB Endowment
Compiler-support for robust multi-core computing

ISoLA'10 Proceedings of the 4th international conference on Leveraging applications of formal methods, verification, and validation - Volume Part I
Reliable distributed data stream management in mobile environments

Information Systems
Theoretical and experimental evaluation of communication-induced checkpointing protocols in FE and FLazy-E families

Performance Evaluation
Hybrid checkpointing using emerging nonvolatile memories for future exascale systems

ACM Transactions on Architecture and Code Optimization (TACO)
Static analysis meets distributed fault-tolerance: enabling state-machine replication with nondeterminism

HotDep'06 Proceedings of the Second conference on Hot topics in system dependability
New & efficient low overheads algorithm for mobile distributed systems

Proceedings of the International Conference & Workshop on Emerging Trends in Technology
New & efficient low overheads algorithm for mobile distributed systems

Proceedings of the International Conference & Workshop on Emerging Trends in Technology
Architecting dependable systems with proactive fault management

Architecting dependable systems VII
A latency and fault-tolerance optimizer for online parallel query plans

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Fast checkpoint recovery algorithms for frequently consistent applications

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
RAFT at work: speeding-up mapreduce applications under task and node failures

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A Petri net model for service availability in redundant computing systems

Winter Simulation Conference
Algorithm-based recovery for iterative methods without checkpointing

Proceedings of the 20th international symposium on High performance distributed computing
From session guarantees to contract guarantees for consistency of SOA-compliant processing

ACIIDS'11 Proceedings of the Third international conference on Intelligent information and database systems - Volume Part I
Log-based middleware server recovery with transaction support

The VLDB Journal — The International Journal on Very Large Data Bases
Rebound: scalable checkpointing for coherent shared memory

Proceedings of the 38th annual international symposium on Computer architecture
On the design of perturbation-resilient atomic commit protocols for mobile transactions

ACM Transactions on Computer Systems (TOCS)
Federate Fault Tolerance in HLA-Based Simulation

PADS '10 Proceedings of the 2010 IEEE Workshop on Principles of Advanced and Distributed Simulation
Tolerating correlated failures for generalized Cartesian distributions via bipartite matching

Proceedings of the 8th ACM International Conference on Computing Frontiers
Causal cycle based communication pattern matching

ICDCN'10 Proceedings of the 11th international conference on Distributed computing and networking
MAHEVE: an efficient reliable mapping of asynchronous iterative applications on volatile and heterogeneous environments

Euro-Par 2010 Proceedings of the 2010 conference on Parallel processing
On the use of cluster-based partial message logging to improve fault tolerance for MPI HPC applications

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Application-specific fault tolerance via data access characterization

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Controlling reversibility in higher-order Pi

CONCUR'11 Proceedings of the 22nd international conference on Concurrency theory
ReServE service: an approach to increase reliability in service oriented systems

PaCT'11 Proceedings of the 11th international conference on Parallel computing technologies
libhashckpt: hash-based incremental checkpointing using GPU's

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Checkpointing strategies for parallel jobs

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
BlobCR: efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Evaluating the viability of process replication reliability for exascale systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Unstoppable stateful PHP web services

WISE'06 Proceedings of the 7th international conference on Web Information Systems
Robust web services via interaction contracts

TES'04 Proceedings of the 5th international conference on Technologies for E-Services
An efficient and scalable checkpointing and recovery algorithm for distributed systems

ICDCN'06 Proceedings of the 8th international conference on Distributed Computing and Networking
A hybrid fault tolerance scheme for EasyGrid MPI applications

Proceedings of the 9th International Workshop on Middleware for Grids, Clouds and e-Science
Safety of rollback-recovery protocol maintaining WFR session guarantee

ISCIS'06 Proceedings of the 21st international conference on Computer and Information Sciences
An intelligent management of fault tolerance in cluster using RADICMPI

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Extended mpijava for distributed checkpointing and recovery

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Checkpointing and communication pattern-neutral algorithm for removing messages logged by senders

HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
An operating system infrastructure for fault-tolerant reconfigurable networks

ARCS'06 Proceedings of the 19th international conference on Architecture of Computing Systems
An efficient computing-checkpoint based coordinated checkpoint algorithm

EUC'06 Proceedings of the 2006 international conference on Embedded and Ubiquitous Computing
Self-refined fault tolerance in HPC using dynamic dependent process groups

IWDC'05 Proceedings of the 7th international conference on Distributed Computing
A framework for automatic identification of the best checkpoint and recovery protocol

IWDC'04 Proceedings of the 6th international conference on Distributed Computing
Garbage collection in a causal message logging protocol

HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
Using computing checkpoints implement consistent low-cost non-blocking coordinated checkpointing

PDCAT'04 Proceedings of the 5th international conference on Parallel and Distributed Computing: applications and Technologies
Novel recovery mechanism for the restoration of image contents in teleconsultation sessions

Computer Methods and Programs in Biomedicine
Group communication: from practice to theory

SOFSEM'06 Proceedings of the 32nd conference on Current Trends in Theory and Practice of Computer Science
Bounding recovery time in rollback-recovery protocol for mobile systems preserving session guarantees

DAIS'06 Proceedings of the 6th IFIP WG 6.1 international conference on Distributed Applications and Interoperable Systems
Safety of recovery protocol preserving MW session guarantee in mobile systems

ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part IV
Rollback-recovery protocol guarantying MR session guarantee in distributed systems with mobile clients

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
A checkpoint/recovery model for heterogeneous dataflow computations using work-stealing

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Checkpointing and migration of communication channels in heterogeneous grid environments

ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Determining consistent states of distributed objects participating in a remote method call

ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part I
Implementing rollback-recovery coordinated checkpoints

ISSADS'05 Proceedings of the 5th international conference on Advanced Distributed Systems
Performance evaluation of consistent recovery protocols using MPICH-GF

EDCC'05 Proceedings of the 5th European conference on Dependable Computing
Parallel checkpointing on a grid-enabled java platform

EGC'05 Proceedings of the 2005 European conference on Advances in Grid Computing
Dynamic failure management for parallel applications on grids

EGC'05 Proceedings of the 2005 European conference on Advances in Grid Computing
Efficient and coordinated checkpointing for reliable distributed data stream management

ADBIS'06 Proceedings of the 10th East European conference on Advances in Databases and Information Systems
Dependable systems

Dependable Systems
Fault-tolerant parallel applications with dynamic parallel schedules: a programmer's perspective

Dependable Systems
Formal service-oriented development of fault tolerant communicating systems

Rigorous Development of Complex Fault-Tolerant Systems
Application-Level checkpointing techniques for parallel programs

ICDCIT'06 Proceedings of the Third international conference on Distributed Computing and Internet Technology
Hardware instruction counting for log-based rollback recovery on x86-family processors

ISAS'06 Proceedings of the Third international conference on Service Availability
A dead-lock free self-healing algorithm for distributed transactional processes

ICISS'06 Proceedings of the Second international conference on Information Systems Security
An efficient algorithm for removing useless logged messages in SBML protocols

ICDCIT'05 Proceedings of the Second international conference on Distributed Computing and Internet Technology
Analysis of interval-based global state detection

ICDCIT'05 Proceedings of the Second international conference on Distributed Computing and Internet Technology
Specification and synthesis of hardware checkpointing and rollback mechanisms

Proceedings of the 49th Annual Design Automation Conference
Consistent rollback protocols for autonomic ASSISTANT applications

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Cooperative Application/OS DRAM fault recovery

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
On the viability of checkpoint compression for extreme scale fault tolerance

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Impact of over-decomposition on coordinated checkpoint/rollback protocol

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
HOPE: A Hybrid Optimistic checkpointing and selective Pessimistic mEssage logging protocol for large scale distributed systems

Future Generation Computer Systems
Checkpoint scheduling model for optimality

Information Processing Letters
Massively-parallel stream processing under QoS constraints with Nephele

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Data-driven fault tolerance for work stealing computations

Proceedings of the 26th ACM international conference on Supercomputing
Independent checkpointing in a heterogeneous grid environment

Future Generation Computer Systems
Checkpointing Orchestration: Toward a Scalable HPC Fault-Tolerant Environment

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Deriving a unified fault taxonomy for event-based systems

Proceedings of the 6th ACM International Conference on Distributed Event-Based Systems
Composable reliability for asynchronous systems

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
A multi-cycle checkpointing protocol that ensures strict 1-rollback

Information Processing Letters
NV-process: a fault-tolerance process model based on non-volatile memory

Proceedings of the Asia-Pacific Workshop on Systems
Transparent optimistic synchronization in the high-level architecture via time-management conversion

ACM Transactions on Modeling and Computer Simulation (TOMACS)
NV-process: a fault-tolerance process model based on non-volatile memory

APSys'12 Proceedings of the Third ACM SIGOPS Asia-Pacific conference on Systems
Automatic undo for cloud management via AI planning

HotDep'12 Proceedings of the Eighth USENIX conference on Hot Topics in System Dependability
McrEngine: a scalable checkpointing system using data-aware aggregation and compression

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Alleviating scalability issues of checkpointing protocols

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Themis: an I/O-efficient MapReduce

Proceedings of the Third ACM Symposium on Cloud Computing
enhancing fault-tolerance of large-scale MPI scientific applications

PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies
Fault tolerance: case study

Proceedings of the Second International Conference on Computational Science, Engineering and Information Technology
On the optimality of rollback-recovery protocol preserving session guarantees

ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Low cost self-healing in MPI applications

PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
D-reserve: distributed reliable service environment

ADBIS'12 Proceedings of the 16th East European conference on Advances in Databases and Information Systems
Fault Tolerant Architecture to Cloud Computing Using Adaptive Checkpoint

International Journal of Cloud Applications and Computing
Abstractions and Middleware for Petascale Computing and Beyond

International Journal of Distributed Systems and Technologies
Soft-Checkpointing Based Hybrid Synchronous Checkpointing Protocol for Mobile Distributed Systems

International Journal of Distributed Systems and Technologies
Checkpointing SystemC-Based Virtual Platforms

International Journal of Embedded and Real-Time Communication Systems
A mid-career review of teaching computer science I

Proceeding of the 44th ACM technical symposium on Computer science education
The viability of using compression to decrease message log sizes

Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Failure-atomic msync(): a simple and efficient mechanism for preserving the integrity of durable data

Proceedings of the 8th ACM European Conference on Computer Systems
Consistency guarantees for recovery of service-oriented distributed processing

International Journal of Intelligent Information and Database Systems
Evaluating the feasibility of using memory content similarity to improve system resilience

Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers
Memory array protection: check on read or check on write?

Proceedings of the Conference on Design, Automation and Test in Europe
Consistent and efficient output-streams management in optimistic simulation platforms

Proceedings of the 2013 ACM SIGSIM conference on Principles of advanced discrete simulation
BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds

Journal of Parallel and Distributed Computing
Automatic recovery from runtime failures

Proceedings of the 2013 International Conference on Software Engineering
A framework for self-healing software systems

Proceedings of the 2013 International Conference on Software Engineering
Rollback-recovery without checkpoints in distributed event processing systems

Proceedings of the 7th ACM international conference on Distributed event-based systems
Performance comparison under failures of MPI and MapReduce: An analytical approach

Future Generation Computer Systems
Arrakis: a case for the end of the empire

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
SPBC: leveraging the characteristics of MPI HPC applications for scalable checkpointing

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
"All roads lead to Rome": optimistic recovery for distributed iterative data processing

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Customizable execution environments for evolutionary computation using BOINC + virtualization

Natural Computing: an international journal
Program transformation techniques applied to languages used in high performance computing

Proceedings of the 2013 companion publication for conference on Systems, programming, & applications: software for humanity
A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

The Journal of Supercomputing
Post-failure recovery of MPI communication capability: Design and rationale

International Journal of High Performance Computing Applications
Exception handlers for healing component-based systems

ACM Transactions on Software Engineering and Methodology (TOSEM) - Testing, debugging, and error handling, formal methods, lifecycle concerns, evolution and maintenance
Online checkpointing with improved worst-case guarantees

ICALP'13 Proceedings of the 40th international conference on Automata, Languages, and Programming - Volume Part I
Multi-criteria checkpointing strategies: response-time versus resource utilization

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Evaluating energy savings for checkpoint/restart

E2SC '13 Proceedings of the 1st International Workshop on Energy Efficient Supercomputing
HotSnap: a hot distributed snapshot system for virtual machine cluster

LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
Supporting undoability in systems operations

LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
Software health management with Bayesian networks

Innovations in Systems and Software Engineering
Compiler-Assisted Checkpointing of Parallel Codes: The Cetus and LLVM Experience

International Journal of Parallel Programming
Accelerating incremental checkpointing for extreme-scale computing

Future Generation Computer Systems
X10-FT: Transparent fault tolerance for APGAS language and runtime

Parallel Computing
Orphan-Free Consistent Condition for Log-Based Checkpointing and Rollback Recovery Scheme

International Journal of Advanced Pervasive and Ubiquitous Computing
McrEngine: A scalable checkpointing system using data-aware aggregation and compression

Scientific Programming - Selected Papers from Super Computing 2012
Nephele streaming: stream processing under QoS constraints at scale

Cluster Computing

Quantified Score

Hi-index	0.01

Visualization

Abstract

This survey covers rollback-recovery techniques that do not require special language constructs. In the first part of the survey we classify rollback-recovery protocols into checkpoint-based and log-based. Checkpoint-based protocols rely solely on checkpointing for system state restoration. Checkpointing can be coordinated, uncoordinated, or communication-induced. Log-based protocols combine checkpointing with logging of nondeterministic events, encoded in tuples called determinants. Depending on how determinants are logged, log-based protocols can be pessimistic, optimistic, or causal. Throughout the survey, we highlight the research issues that are at the core of rollback-recovery and present the solutions that currently address them. We also compare the performance of different rollback-recovery protocols with respect to a series of desirable properties and discuss the issues that arise in the practical implementations of these protocols.