Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

Authors:
Elmootazbellah N. Elnozahy;Willy Zwaenepoel
Affiliations:
-;-
Venue:
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Year:
1992

Citing 12
Cited 88

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Fault tolerance under UNIX

ACM Transactions on Computer Systems (TOCS)
Preserving and using context information in interprocess communication

ACM Transactions on Computer Systems (TOCS)
Efficient distributed recovery using message logging

Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Recoverable Distributed Shared Virtual Memory

IEEE Transactions on Computers
Recovery in distributed systems using optimistic message logging and check-pointing

Journal of Algorithms
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Fail-stop processors: an approach to designing fault-tolerant computing systems

ACM Transactions on Computer Systems (TOCS)
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Error Recovery in Shared Memory Multiprocessors Using Private Caches

IEEE Transactions on Parallel and Distributed Systems
Publishing: a reliable broadcast communication mechanism

SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles

On the relevance of communication costs of rollback-recovery protocols

Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing
Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems

IEEE Transactions on Parallel and Distributed Systems
Adaptive recovery for mobile environments

Communications of the ACM
Trade-offs in implementing causal message logging protocols

PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints

IEEE Transactions on Computers
Persistent messages in local transactions

PODC '98 Proceedings of the seventeenth annual ACM symposium on Principles of distributed computing
Support for Software Interrupts in Log-Based Rollback-Recovery

IEEE Transactions on Computers
Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
An Index-Based Checkpointing Algorithm for Autonomous Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
SFT: a consistent checkpointing algorithm with shorter freezing time

ACM SIGOPS Operating Systems Review
Staggered Consistent Checkpointing

IEEE Transactions on Parallel and Distributed Systems
Computing in the RAIN: A Reliable Array of Independent Nodes

IEEE Transactions on Parallel and Distributed Systems
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Efficient Garbage Collection Schemes for Causal Message Logging with Independent Checkpointing

The Journal of Supercomputing
Complete Process Recovery: Using Vector Time to Handle Multiple Failures in Distributed Systems

IEEE Parallel & Distributed Technology: Systems & Technology
The Cost of Recovery in Message Logging Protocols

IEEE Transactions on Knowledge and Data Engineering
Design and Analysis of an Integrated Checkpointing and Recovery Scheme for Distributed Applications

IEEE Transactions on Knowledge and Data Engineering
Efficient Rollback-Recovery Technique in Distributed Computing Systems

IEEE Transactions on Parallel and Distributed Systems
Message Logging: Pessimistic, Optimistic, Causal, and Optimal

IEEE Transactions on Software Engineering
Low-Cost Garbage Collection for Causal Message Logging

HiPC '01 Proceedings of the 8th International Conference on High Performance Computing
Fault-Tolerant Parallel Applications Using Queues and Actions

ICPP '97 Proceedings of the international Conference on Parallel Processing
Efficient Garbage Collection Schemes for Causal Message Logging with Independent Checkpointing in Message Passing Systems

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Efficient Fault-Tolerant Protocol for Mobility Agents in Mobile IP

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Distributed Checkpointing on Clusters with Dynamic Striping and Staggering

ASIAN '02 Proceedings of the7th Asian Computing Science Conference on Advances in Computing Science: Internet Computing and Modeling, Grid Computing, Peer-to-Peer Computing, and Cluster
Consistent and Efficient Recovery for Causal Message Logging

ICOIN '02 Revised Papers from the International Conference on Information Networking, Wireless Communications Technologies and Network Applications-Part II
Scalable Causal Message Logging for Wide-Area Environments

Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
State Synchronization and Recovery for Strongly Consistent Replicated CORBA Objects

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
An Efficient Optimistic Message Logging Scheme for Recoverable Mobile Computing Systems

IEEE Transactions on Mobile Computing
Automated application-level checkpointing of MPI programs

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Collective operations in application-level fault-tolerant MPI

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Supporting nondeterministic execution in fault-tolerant systems

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Concurrent rollback for crash recovery in extended hypercube networks

PAS '95 Proceedings of the First Aizu International Symposium on Parallel Algorithms/Architecture Synthesis
Implementation and performance of a stable-storage service in Unix

SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Minimizing timestamp size for completely asynchronous optimistic recovery with minimal rollback

SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
User-Triggered Checkpointing: System-Independent and Scalable Application Recovery

ISCC '97 Proceedings of the 2nd IEEE Symposium on Computers and Communications (ISCC '97)
Completely Asynchronous Optimistic Recovery with Minimal Rollbacks

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Why Optimistic Message Logging Has Not Been Used in Telecommunications Systems

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Fault Tolerance for Off-the-Shelf Applications and Hardware

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Software Schemes of Reconfiguration and Recovery in Distributed Memory Multicomputers Using the Actor Model

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
An algorithm for Supporting Fault Tolerant Objects in Distributed Object-Oriented Operating Systems

IWOOOS '95 Proceedings of the 4th International Workshop on Object-Orientation in Operating Systems
A comparative analysis of the reliability of simple and two-level checkpointing techniques in two different distributed industrial control system architectures

Systems Analysis Modelling Simulation
Distributed recovery with K-optimistic logging

Journal of Parallel and Distributed Computing
Causality tracking in causal message-logging protocols

Distributed Computing
The development of an efficient checkpointing facility exploiting operating systems services of the GENESIS cluster operating system

Future Generation Computer Systems - Special issue: Advanced services for clusters and internet computing
A causal message logging protocol for mobile nodes in mobile computing systems

Future Generation Computer Systems - Special issue: Advanced services for clusters and internet computing
Approaches to fault-tolerant and transactional mobile agent execution---an algorithmic view

ACM Computing Surveys (CSUR)
Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Impact of Event Logger on Causal Message Logging Protocols for Fault Tolerant MPI

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Combining FT-MPI with H2O: Fault-Tolerant MPI Across Administrative Boundaries

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 1 - Volume 02
Fault-Tolerant Parallel Applications with Dynamic Parallel Schedules

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 16 - Volume 17
Efficient algorithms for optimistic crash recovery

Distributed Computing
Failure Resilient Heterogeneous Parallel Computing Across Multidomain Clusters

International Journal of High Performance Computing Applications
HPC-Colony: services and interfaces for very large systems

ACM SIGOPS Operating Systems Review
Performance analysis of different checkpointing and recovery schemes using stochastic model

Journal of Parallel and Distributed Computing
Finding a suitable checkpoint and recovery protocol for a distributed application

Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
Exploring failure transparency and the limits of generic recovery

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Rethink the sync

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Adaptive and reliable parallel computing on networks of workstations

ATEC '97 Proceedings of the annual conference on USENIX Annual Technical Conference
Proactive fault tolerance for HPC with Xen virtualization

Proceedings of the 21st annual international conference on Supercomputing
Rethink the sync

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Coordinated checkpoint versus message log for fault tolerant MPI

International Journal of High Performance Computing and Networking
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

Information Sciences: an International Journal
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

Information Sciences: an International Journal
Rethink the sync

ACM Transactions on Computer Systems (TOCS)
Proactive process-level live migration in HPC environments

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Exception Diagnosis Architecture for Open Multi-Agent Systems

Software Engineering for Multi-Agent Systems V
An optimistic checkpointing and message logging approach for consistent global checkpoint collection in distributed systems

Journal of Parallel and Distributed Computing
Active Optimistic Message Logging for Reliable Execution of MPI Applications

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Message fragment based causal message logging

Journal of Parallel and Distributed Computing
Team-Based Message Logging: Preliminary Results

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Checkpointing and rollback-recovery protocol for mobile systems with MW session guarantee

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Rebound: scalable checkpointing for coherent shared memory

Proceedings of the 38th annual international symposium on Computer architecture
Proactive fault tolerance in MPI applications via task migration

HiPC'06 Proceedings of the 13th international conference on High Performance Computing
FT-MPI, fault-tolerant metacomputing and generic name services: a case study

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
An asynchronous recovery algorithm based on a staggered quasi-synchronous checkpointing algorithm

IWDC'05 Proceedings of the 7th international conference on Distributed Computing
Applicability of generic naming services and fault-tolerant metacomputing with FT-MPI

PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Garbage collection in a causal message logging protocol

HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
Proactive process-level live migration and back migration in HPC environments

Journal of Parallel and Distributed Computing
An efficient protocol for checkpoint-based failure recovery in distributed systems

ICDCIT'04 Proceedings of the First international conference on Distributed Computing and Internet Technology
Implementing rollback-recovery coordinated checkpoints

ISSADS'05 Proceedings of the 5th international conference on Advanced Distributed Systems
Performance evaluation of consistent recovery protocols using MPICH-GF

EDCC'05 Proceedings of the 5th European conference on Dependable Computing
Fault-tolerant parallel applications with dynamic parallel schedules: a programmer's perspective

Dependable Systems
A sentinel based exception diagnosis in market based multi-agent systems

DEECS'06 Proceedings of the Second international conference on Data Engineering Issues in E-Commerce and Services
Euripus: a flexible unified hardware memory checkpointing accelerator for bidirectional-debugging and reliability

Proceedings of the 39th Annual International Symposium on Computer Architecture
Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems

Scientific Programming - Selected Papers from Super Computing 2012

Quantified Score

Hi-index	0.02

Visualization

Abstract

Manetho is a new transparent rollback-recovery protocol for long-running distributed computations. It uses a novel combination of antecedence graph maintenance, uncoordinated checkpointing, and sender-based message logging. Manetho simultaneously achieves the advantages of pessimistic message logging, namely limited rollback and, fast output commit, and the advantage of optimistic message logging, namely low failure-free overhead. These advantages come at the expense of a complex recovery scheme.