Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
ACM Transactions on Computer Systems (TOCS)
Preserving and using context information in interprocess communication
ACM Transactions on Computer Systems (TOCS)
Efficient distributed recovery using message logging
Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Recoverable Distributed Shared Virtual Memory
IEEE Transactions on Computers
Recovery in distributed systems using optimistic message logging and check-pointing
Journal of Algorithms
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Fail-stop processors: an approach to designing fault-tolerant computing systems
ACM Transactions on Computer Systems (TOCS)
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Error Recovery in Shared Memory Multiprocessors Using Private Caches
IEEE Transactions on Parallel and Distributed Systems
Publishing: a reliable broadcast communication mechanism
SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
On the relevance of communication costs of rollback-recovery protocols
Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing
Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems
IEEE Transactions on Parallel and Distributed Systems
Adaptive recovery for mobile environments
Communications of the ACM
Trade-offs in implementing causal message logging protocols
PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints
IEEE Transactions on Computers
Persistent messages in local transactions
PODC '98 Proceedings of the seventeenth annual ACM symposium on Principles of distributed computing
Support for Software Interrupts in Log-Based Rollback-Recovery
IEEE Transactions on Computers
IEEE Transactions on Parallel and Distributed Systems
An Index-Based Checkpointing Algorithm for Autonomous Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
SFT: a consistent checkpointing algorithm with shorter freezing time
ACM SIGOPS Operating Systems Review
Staggered Consistent Checkpointing
IEEE Transactions on Parallel and Distributed Systems
Computing in the RAIN: A Reliable Array of Independent Nodes
IEEE Transactions on Parallel and Distributed Systems
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Efficient Garbage Collection Schemes for Causal Message Logging with Independent Checkpointing
The Journal of Supercomputing
Complete Process Recovery: Using Vector Time to Handle Multiple Failures in Distributed Systems
IEEE Parallel & Distributed Technology: Systems & Technology
The Cost of Recovery in Message Logging Protocols
IEEE Transactions on Knowledge and Data Engineering
Design and Analysis of an Integrated Checkpointing and Recovery Scheme for Distributed Applications
IEEE Transactions on Knowledge and Data Engineering
Efficient Rollback-Recovery Technique in Distributed Computing Systems
IEEE Transactions on Parallel and Distributed Systems
Message Logging: Pessimistic, Optimistic, Causal, and Optimal
IEEE Transactions on Software Engineering
Low-Cost Garbage Collection for Causal Message Logging
HiPC '01 Proceedings of the 8th International Conference on High Performance Computing
Fault-Tolerant Parallel Applications Using Queues and Actions
ICPP '97 Proceedings of the international Conference on Parallel Processing
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Efficient Fault-Tolerant Protocol for Mobility Agents in Mobile IP
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Distributed Checkpointing on Clusters with Dynamic Striping and Staggering
ASIAN '02 Proceedings of the7th Asian Computing Science Conference on Advances in Computing Science: Internet Computing and Modeling, Grid Computing, Peer-to-Peer Computing, and Cluster
Consistent and Efficient Recovery for Causal Message Logging
ICOIN '02 Revised Papers from the International Conference on Information Networking, Wireless Communications Technologies and Network Applications-Part II
Scalable Causal Message Logging for Wide-Area Environments
Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
State Synchronization and Recovery for Strongly Consistent Replicated CORBA Objects
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
An Efficient Optimistic Message Logging Scheme for Recoverable Mobile Computing Systems
IEEE Transactions on Mobile Computing
Automated application-level checkpointing of MPI programs
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Collective operations in application-level fault-tolerant MPI
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Supporting nondeterministic execution in fault-tolerant systems
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Concurrent rollback for crash recovery in extended hypercube networks
PAS '95 Proceedings of the First Aizu International Symposium on Parallel Algorithms/Architecture Synthesis
Implementation and performance of a stable-storage service in Unix
SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Minimizing timestamp size for completely asynchronous optimistic recovery with minimal rollback
SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
User-Triggered Checkpointing: System-Independent and Scalable Application Recovery
ISCC '97 Proceedings of the 2nd IEEE Symposium on Computers and Communications (ISCC '97)
Completely Asynchronous Optimistic Recovery with Minimal Rollbacks
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Why Optimistic Message Logging Has Not Been Used in Telecommunications Systems
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Fault Tolerance for Off-the-Shelf Applications and Hardware
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
An algorithm for Supporting Fault Tolerant Objects in Distributed Object-Oriented Operating Systems
IWOOOS '95 Proceedings of the 4th International Workshop on Object-Orientation in Operating Systems
Distributed recovery with K-optimistic logging
Journal of Parallel and Distributed Computing
Causality tracking in causal message-logging protocols
Distributed Computing
Future Generation Computer Systems - Special issue: Advanced services for clusters and internet computing
A causal message logging protocol for mobile nodes in mobile computing systems
Future Generation Computer Systems - Special issue: Advanced services for clusters and internet computing
Approaches to fault-tolerant and transactional mobile agent execution---an algorithmic view
ACM Computing Surveys (CSUR)
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Impact of Event Logger on Causal Message Logging Protocols for Fault Tolerant MPI
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Combining FT-MPI with H2O: Fault-Tolerant MPI Across Administrative Boundaries
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 1 - Volume 02
Fault-Tolerant Parallel Applications with Dynamic Parallel Schedules
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 16 - Volume 17
Efficient algorithms for optimistic crash recovery
Distributed Computing
Failure Resilient Heterogeneous Parallel Computing Across Multidomain Clusters
International Journal of High Performance Computing Applications
HPC-Colony: services and interfaces for very large systems
ACM SIGOPS Operating Systems Review
Performance analysis of different checkpointing and recovery schemes using stochastic model
Journal of Parallel and Distributed Computing
Finding a suitable checkpoint and recovery protocol for a distributed application
Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
Exploring failure transparency and the limits of generic recovery
OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Adaptive and reliable parallel computing on networks of workstations
ATEC '97 Proceedings of the annual conference on USENIX Annual Technical Conference
Proactive fault tolerance for HPC with Xen virtualization
Proceedings of the 21st annual international conference on Supercomputing
OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Coordinated checkpoint versus message log for fault tolerant MPI
International Journal of High Performance Computing and Networking
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage
Information Sciences: an International Journal
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage
Information Sciences: an International Journal
ACM Transactions on Computer Systems (TOCS)
Proactive process-level live migration in HPC environments
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Exception Diagnosis Architecture for Open Multi-Agent Systems
Software Engineering for Multi-Agent Systems V
Journal of Parallel and Distributed Computing
Active Optimistic Message Logging for Reliable Execution of MPI Applications
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Message fragment based causal message logging
Journal of Parallel and Distributed Computing
Team-Based Message Logging: Preliminary Results
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Checkpointing and rollback-recovery protocol for mobile systems with MW session guarantee
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Rebound: scalable checkpointing for coherent shared memory
Proceedings of the 38th annual international symposium on Computer architecture
Proactive fault tolerance in MPI applications via task migration
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
FT-MPI, fault-tolerant metacomputing and generic name services: a case study
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
An asynchronous recovery algorithm based on a staggered quasi-synchronous checkpointing algorithm
IWDC'05 Proceedings of the 7th international conference on Distributed Computing
Applicability of generic naming services and fault-tolerant metacomputing with FT-MPI
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Garbage collection in a causal message logging protocol
HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
Proactive process-level live migration and back migration in HPC environments
Journal of Parallel and Distributed Computing
An efficient protocol for checkpoint-based failure recovery in distributed systems
ICDCIT'04 Proceedings of the First international conference on Distributed Computing and Internet Technology
Implementing rollback-recovery coordinated checkpoints
ISSADS'05 Proceedings of the 5th international conference on Advanced Distributed Systems
Performance evaluation of consistent recovery protocols using MPICH-GF
EDCC'05 Proceedings of the 5th European conference on Dependable Computing
A sentinel based exception diagnosis in market based multi-agent systems
DEECS'06 Proceedings of the Second international conference on Data Engineering Issues in E-Commerce and Services
Proceedings of the 39th Annual International Symposium on Computer Architecture
Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems
Scientific Programming - Selected Papers from Super Computing 2012
Hi-index | 0.02 |
Manetho is a new transparent rollback-recovery protocol for long-running distributed computations. It uses a novel combination of antecedence graph maintenance, uncoordinated checkpointing, and sender-based message logging. Manetho simultaneously achieves the advantages of pessimistic message logging, namely limited rollback and, fast output commit, and the advantage of optimistic message logging, namely low failure-free overhead. These advantages come at the expense of a complex recovery scheme.