Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
Information Processing Letters
ACM Transactions on Computer Systems (TOCS)
A software instruction counter
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
IGOR: a system for program debugging via reversible execution
PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
Efficient distributed recovery using message logging
Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Recovery in distributed systems using optimistic message logging and check-pointing
Journal of Algorithms
Efficient checkpointing on MIMD architectures
Efficient checkpointing on MIMD architectures
Space reclamation for uncoordinated checkpointing in message-passing systems
Space reclamation for uncoordinated checkpointing in message-passing systems
Necessary and Sufficient Conditions for Consistent Global Snapshots
IEEE Transactions on Parallel and Distributed Systems
Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems.
IEEE Transactions on Parallel and Distributed Systems
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Hypervisor-based fault tolerance
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Understanding the message logging paradigm for masking process crashes
Understanding the message logging paradigm for masking process crashes
Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems
IEEE Transactions on Parallel and Distributed Systems
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints
IEEE Transactions on Computers
Application level fault tolerance in heterogeneous networks of workstations
Journal of Parallel and Distributed Computing
A Survey of Recoverable Distributed Shared Virtual Memory Systems
IEEE Transactions on Parallel and Distributed Systems
Support for Software Interrupts in Log-Based Rollback-Recovery
IEEE Transactions on Computers
Fail-stop processors: an approach to designing fault-tolerant computing systems
ACM Transactions on Computer Systems (TOCS)
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Rollback Recovery in Distributed Systems Using Loosely Synchronized Clocks
IEEE Transactions on Parallel and Distributed Systems
Message Logging: Pessimistic, Optimistic, Causal, and Optimal
IEEE Transactions on Software Engineering
Performance of Consistent Checkpointing in a Modular Operating System: Results of the FTM Experiment
EDCC-1 Proceedings of the First European Dependable Computing Conference on Dependable Computing
Ensuring Data Security and Integrity with a Fast Stable Storage
Proceedings of the Fourth International Conference on Data Engineering
Experimental Evaluation of Concurrency Checkpointing and Rollback-Recovery Algorithms
Proceedings of the Sixth International Conference on Data Engineering
Virtual Precedence in Asynchronous Systems: Cencept and Applications
WDAG '97 Proceedings of the 11th International Workshop on Distributed Algorithms
FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
How Safe is Probabilistic Checkpointing?
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
An Analysis of Communication-Induced Checkpointing
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
Converting a swap-based system to do paging in an architecture lacking page-referenced bits
SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
Minimizing timestamp size for completely asynchronous optimistic recovery with minimal rollback
SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Preventing Useless Checkpoints in Distributed Computations
SRDS '97 Proceedings of the 16th Symposium on Reliable Distributed Systems
A VP-Accordant Checkpointing Protocol Preventing Useless Checkpoints
SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
The Cost of Recovery in Message Logging Protocols
SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
Why Optimistic Message Logging Has Not Been Used in Telecommunications Systems
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Adding input and output to the transactional model
Adding input and output to the transactional model
Distributed system fault tolerance using message logging and checkpointing
Distributed system fault tolerance using message logging and checkpointing
Manetho: fault tolerance in distributed systems using rollback-recovery and process replication
Manetho: fault tolerance in distributed systems using rollback-recovery and process replication
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Checkpoint-Recovery for Mobile Intelligent Networks
Proceedings of the 14th International conference on Industrial and engineering applications of artificial intelligence and expert systems: engineering of intelligent systems
MPICH-CM: A Communication Library Design for a P2P MPI Implementation
Proceedings of the 9th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
ReVirt: enabling intrusion analysis through virtual-machine logging and replay
ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
On Properties of RDT Communication-Induced Checkpointing Protocols
IEEE Transactions on Parallel and Distributed Systems
Distributed recovery with K-optimistic logging
Journal of Parallel and Distributed Computing
ACM SIGCOMM Computer Communication Review
Improving Logging and Recovery Performance in Phoenix/App
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Energy-aware deterministic fault tolerance in distributed real-time embedded systems
Proceedings of the 41st annual Design Automation Conference
Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing
Recovery guarantees for Internet applications
ACM Transactions on Internet Technology (TOIT)
Concurrent checkpoint initiation and recovery algorithms on asynchronous ring networks
Journal of Parallel and Distributed Computing
Checkpointing-based rollback recovery for parallel applications on the InteGrade grid middleware
MGC '04 Proceedings of the 2nd workshop on Middleware for grid computing
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery
IEEE Transactions on Dependable and Secure Computing
Replication for web hosting systems
ACM Computing Surveys (CSUR)
Replication for web hosting systems
ACM Computing Surveys (CSUR)
High-Availability Algorithms for Distributed Stream Processing
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Impact of Event Logger on Causal Message Logging Protocols for Fault Tolerant MPI
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 1 - Volume 02
Fault-Tolerant Parallel Applications with Dynamic Parallel Schedules
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 16 - Volume 17
ReVirt: enabling intrusion analysis through virtual-machine logging and replay
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Fault-tolerance in the Borealis distributed stream processing system
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
BugNet: Continuously Recording Program Execution for Deterministic Replay Debugging
Proceedings of the 32nd annual international symposium on Computer Architecture
A channel memory based fault tolerance for MPI applications
Future Generation Computer Systems - Special issue: Parallel computing technologies
Surviving Errors in Component-Based Software
EUROMICRO '05 Proceedings of the 31st EUROMICRO Conference on Software Engineering and Advanced Applications
Vigilante: end-to-end containment of internet worms
Proceedings of the twentieth ACM symposium on Operating systems principles
Speculative execution in a distributed file system
Proceedings of the twentieth ACM symposium on Operating systems principles
Rx: treating bugs as allergies---a safe method to survive software failures
Proceedings of the twentieth ACM symposium on Operating systems principles
Using Consistent Global Checkpoints to Synchronize Processes in Distributed Simulation
DS-RT '05 Proceedings of the 9th IEEE International Symposium on Distributed Simulation and Real-Time Applications
An Efficient Index-Based Checkpointing Protocol with Constant-Size Control Information on Messages
IEEE Transactions on Dependable and Secure Computing
Strategies for storage of checkpointing data using non-dedicated repositories on Grid systems
MGC '05 Proceedings of the 3rd international workshop on Middleware for grid computing
Design and Implementation of Multiple Fault-Tolerant MPI over Myrinet (M^3)
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
A Faster Checkpointing and Recovery Algorithm with a Hierarchical Storage Approach
HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
A virtual machine monitor for utilizing non-dedicated clusters
Proceedings of the twentieth ACM symposium on Operating systems principles
Fast and transparent recovery for continuous availability of cluster-based servers
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Evaluation of a Fault-Tolerance Mechanism for HLA-Based Distributed Simulations
Proceedings of the 20th Workshop on Principles of Advanced and Distributed Simulation
Performance analysis of different checkpointing and recovery schemes using stochastic model
Journal of Parallel and Distributed Computing
Finding a suitable checkpoint and recovery protocol for a distributed application
Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
Stabilizers: a modular checkpointing abstraction for concurrent functional programs
Proceedings of the eleventh ACM SIGPLAN international conference on Functional programming
In-network fault tolerance in networked sensor systems
DIWANS '06 Proceedings of the 2006 workshop on Dependability issues in wireless ad hoc networks and sensor networks
EOS2: unstoppable stateful PHP
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Checkpointing and rollback-recovery protocol integrated with VsSG protocol for RYW session guarantee
PDCN'06 Proceedings of the 24th IASTED international conference on Parallel and distributed computing and networks
Distributed data storage for opportunistic grids
Proceedings of the 3rd international Middleware doctoral symposium
Strategies for Checkpoint Storage on Opportunistic Grids
IEEE Distributed Systems Online
Implementing fault-tolerance in real-time systems by automatic program transformations
EMSOFT '06 Proceedings of the 6th ACM & IEEE International conference on Embedded software
ExecRecorder: VM-based full-system replay for attack analysis and system recovery
Proceedings of the 1st workshop on Architectural and system support for improving software dependability
Toward real-time image guided neurosurgery using distributed and grid computing
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Speculative execution in a distributed file system
ACM Transactions on Computer Systems (TOCS)
Efficient hardware checkpointing: concepts, overhead analysis, and implementation
Proceedings of the 2007 ACM/SIGDA 15th international symposium on Field programmable gate arrays
Declarative failure recovery for sensor networks
Proceedings of the 6th international conference on Aspect-oriented software development
Quasi-atomic recovery for distributed agents
Parallel Computing
Framework for instruction-level tracing and analysis of program executions
Proceedings of the 2nd international conference on Virtual execution environments
Flashback: a lightweight extension for rollback and deterministic replay for software debugging
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Log-based recovery for middleware servers
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
HOTDEP'06 Proceedings of the 2nd conference on Hot Topics in System Dependability - Volume 2
WiDS: an integrated toolkit for distributed system development
HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
Detecting targeted attacks using shadow honeypots
SSYM'05 Proceedings of the 14th conference on USENIX Security Symposium - Volume 14
On the Complexity of Removing Z-Cycles from a Checkpoints and Communication Pattern
IEEE Transactions on Computers
Self-stabilizing algorithm for checkpointing in a distributed system
Journal of Parallel and Distributed Computing
OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Modular Checkpointing for Atomicity
Electronic Notes in Theoretical Computer Science (ENTCS)
Rx: Treating bugs as allergies—a safe method to survive software failures
ACM Transactions on Computer Systems (TOCS)
Efficient checkpointing of java software using context-sensitive capture and replay
Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering
Modeling and design of fault-tolerant and self-adaptive reconfigurable networked embedded systems
EURASIP Journal on Embedded Systems
Bouncer: securing software by blocking bad input
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Transactions with isolation and cooperation
Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications
OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Automated Rule-Based Diagnosis through a Distributed Monitor System
IEEE Transactions on Dependable and Secure Computing
Fault-tolerance in the borealis distributed stream processing system
ACM Transactions on Database Systems (TODS)
DS-RT '07 Proceedings of the 11th IEEE International Symposium on Distributed Simulation and Real-Time Applications
Execution replay of multiprocessor virtual machines
Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Better bug reporting with better privacy
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
A survey of linguistic structures for application-level fault tolerance
ACM Computing Surveys (CSUR)
Information Assurance: Dependability and Security in Networked Systems
Information Assurance: Dependability and Security in Networked Systems
Model-based performance evaluation of distributed checkpointing protocols
Performance Evaluation
A synchronous checkpointing protocol for mobile distributed systems: probabilistic approach
International Journal of Information and Computer Security
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Data-stream-based global event monitoring using pairwise interactions
Journal of Parallel and Distributed Computing
Dependability evaluation of dedicated server group orphan detection method
ICS'05 Proceedings of the 9th WSEAS International Conference on Systems
Preventing of burst traffic in DSG method
ICS'05 Proceedings of the 9th WSEAS International Conference on Systems
AMCOS'05 Proceedings of the 4th WSEAS International Conference on Applied Mathematics and Computer Science
A low-cost hybrid coordinated checkpointing protocol for mobile distributed systems
Mobile Information Systems
Communication analysis of distributed programs
Scientific Programming - Parallel/High-Performance Object-Oriented Scientific Computing (POOSC '05), Glasgow, UK, 25 July 2005
Designing and implementing malicious hardware
LEET'08 Proceedings of the 1st Usenix Workshop on Large-Scale Exploits and Emergent Threats
ACM Transactions on Computer Systems (TOCS)
Synthesis of fault-tolerant embedded systems
Proceedings of the conference on Design, automation and test in Europe
2-step algorithm for enhancing effectiveness of sender-based message logging
SpringSim '07 Proceedings of the 2007 spring simulation multiconference - Volume 2
Novel log management for sender-based message logging
ICAI'08 Proceedings of the 9th WSEAS International Conference on International Conference on Automation and Information
Handling Emergent Nondeterminism in Replicated Services
Architecting Dependable Systems V
Providing Non-stop Service for Message-Passing Based Parallel Applications with RADIC
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Fault-tolerant stream processing using a distributed, replicated file system
Proceedings of the VLDB Endowment
Vigilante: End-to-end containment of Internet worm epidemics
ACM Transactions on Computer Systems (TOCS)
WSEAS Transactions on Computers
The implementation and evaluation of a recovery system for workflows
Journal of Network and Computer Applications
Journal of Parallel and Distributed Computing
Journal of Parallel and Distributed Computing
Engineering of Software-Intensive Systems: State of the Art and Research Challenges
Software-Intensive Systems and New Computing Paradigms
A novel fault-tolerant execution model by using of mobile agents
Journal of Network and Computer Applications
Transactions on High-Performance Embedded Architectures and Compilers I
Sensornet Checkpointing: Enabling Repeatability in Testbeds and Realism in Simulations
EWSN '09 Proceedings of the 6th European Conference on Wireless Sensor Networks
Recovery domains: an organizing principle for recoverable operating systems
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Numerical computation algorithms for sequential checkpoint placement
Performance Evaluation
Algorithm-based fault tolerance applied to high performance computing
Journal of Parallel and Distributed Computing
Transparent checkpoints of closed distributed systems in Emulab
Proceedings of the 4th ACM European conference on Computer systems
A Checkpointing Method with Small Checkpoint Latency
IEICE - Transactions on Information and Systems
A systematic approach to system state restoration during storage controller micro-recovery
FAST '09 Proccedings of the 7th conference on File and storage technologies
RT-replayer: a record-replay architecture for embedded real-time software debugging
Proceedings of the 2009 ACM symposium on Applied Computing
FlashBox: a system for logging non-deterministic events in deployed embedded systems
Proceedings of the 2009 ACM symposium on Applied Computing
Dependability, Abstraction, and Programming
DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
Practical and low-overhead masking of failures of TCP-based servers
ACM Transactions on Computer Systems (TOCS)
Interconnect agnostic checkpoint/restart in open MPI
Proceedings of the 18th ACM international symposium on High performance distributed computing
In-field healing of integration problems with COTS components
ICSE '09 Proceedings of the 31st International Conference on Software Engineering
Characterizing fault tolerance in genetic programming
BADS '09 Proceedings of the 2009 workshop on Bio-inspired algorithms for distributed systems
Tolerating latency in replicated state machines through client speculation
NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
FlashLogging: exploiting flash devices for synchronous logging performance
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
International Journal of High Performance Computing Applications
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Active Optimistic Message Logging for Reliable Execution of MPI Applications
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Message fragment based causal message logging
Journal of Parallel and Distributed Computing
Future Generation Computer Systems
Towards Zero-Delay Recovery of Agents in Production Automation Systems
WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 02
International Journal of High Performance Computing Applications
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Customizable execution environments with virtual desktop grid computing
PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems
PLFS: a checkpoint filesystem for parallel applications
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
VGrADS: enabling e-Science workflows on grids and clouds with fault tolerance
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
FALCON: a system for reliable checkpoint recovery in shared grid environments
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Performance analysis of mobile agent failure recovery in e-service applications
Computer Standards & Interfaces
R-ECS: reliable elastic computing services for building virtual computing environment
Proceedings of the 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human
DSF: a common platform for distributed systems research and development
Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
An empirical study of high availability in stream processing systems
Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
A fault-tolerant strategy for virtualized HPC clusters
The Journal of Supercomputing
Autonomic fault mitigation in embedded systems
Engineering Applications of Artificial Intelligence
A Channel Memory based fault tolerance for MPI applications
Future Generation Computer Systems - Special issue: Parallel computing technologies
A load balancing fault-tolerant algorithm for heterogeneous cluster environments
Neural, Parallel & Scientific Computations
A weighted checkpointing protocol for mobile distributed systems
International Journal of Ad Hoc and Ubiquitous Computing
Journal of Parallel and Distributed Computing
VECPAR'06 Proceedings of the 7th international conference on High performance computing for computational science
A pattern-based approach for modeling and analyzing error recovery
Architecting dependable systems IV
Characterizing fault tolerance in genetic programming
Future Generation Computer Systems
An efficient handoff strategy for mobile computing checkpoint system
EUC'07 Proceedings of the 2007 international conference on Embedded and ubiquitous computing
Schedulable online testing framework for real-time embedded applications in VM
EUC'07 Proceedings of the 2007 international conference on Embedded and ubiquitous computing
A scalable asynchronous replication-based strategy for fault tolerant MPI applications
HiPC'07 Proceedings of the 14th international conference on High performance computing
A novel fault-tolerant parallel algorithm
APPT'07 Proceedings of the 7th international conference on Advanced parallel processing technologies
CPPC-G: fault-tolerant applications on the grid
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
An uncoordinated asynchronous checkpointing model for hierarchical scientific workflows
Journal of Computer and System Sciences
DSF: a common platform for distributed systems research and development
Middleware'09 Proceedings of the ACM/IFIP/USENIX 10th international conference on Middleware
Performance evaluation of an application-level checkpointing solution on grids
Future Generation Computer Systems
Enabling replication in the ASSISTANT programming model
Proceedings of the 6th International Wireless Communications and Mobile Computing Conference
Analysis of service availability for time-triggered rejuvenation policies
Journal of Systems and Software
Lightweight checkpointing for concurrent ml
Journal of Functional Programming
A cost model for autonomic reconfigurations in high-performance pervasive applications
Proceedings of the 4th ACM International Workshop on Context-Awareness for Self-Managing Systems
Automatic workarounds for web applications
Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
DOM transactions for testing JavaScript
TAIC PART'10 Proceedings of the 5th international academic and industrial conference on Testing - practice and research techniques
CONCUR'10 Proceedings of the 21st international conference on Concurrency theory
Improving message logging protocols scalability through distributed event logging
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Checkpoint/restart-enabled parallel debugging
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Recent advances in checkpoint/recovery systems
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Checkpointing and rollback-recovery protocol for mobile systems with MW session guarantee
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Behavioral simulations in MapReduce
Proceedings of the VLDB Endowment
Compiler-support for robust multi-core computing
ISoLA'10 Proceedings of the 4th international conference on Leveraging applications of formal methods, verification, and validation - Volume Part I
Reliable distributed data stream management in mobile environments
Information Systems
Hybrid checkpointing using emerging nonvolatile memories for future exascale systems
ACM Transactions on Architecture and Code Optimization (TACO)
HotDep'06 Proceedings of the Second conference on Hot topics in system dependability
New & efficient low overheads algorithm for mobile distributed systems
Proceedings of the International Conference & Workshop on Emerging Trends in Technology
New & efficient low overheads algorithm for mobile distributed systems
Proceedings of the International Conference & Workshop on Emerging Trends in Technology
Architecting dependable systems with proactive fault management
Architecting dependable systems VII
A latency and fault-tolerance optimizer for online parallel query plans
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Fast checkpoint recovery algorithms for frequently consistent applications
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
RAFT at work: speeding-up mapreduce applications under task and node failures
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A Petri net model for service availability in redundant computing systems
Winter Simulation Conference
Algorithm-based recovery for iterative methods without checkpointing
Proceedings of the 20th international symposium on High performance distributed computing
From session guarantees to contract guarantees for consistency of SOA-compliant processing
ACIIDS'11 Proceedings of the Third international conference on Intelligent information and database systems - Volume Part I
Log-based middleware server recovery with transaction support
The VLDB Journal — The International Journal on Very Large Data Bases
Rebound: scalable checkpointing for coherent shared memory
Proceedings of the 38th annual international symposium on Computer architecture
On the design of perturbation-resilient atomic commit protocols for mobile transactions
ACM Transactions on Computer Systems (TOCS)
Federate Fault Tolerance in HLA-Based Simulation
PADS '10 Proceedings of the 2010 IEEE Workshop on Principles of Advanced and Distributed Simulation
Tolerating correlated failures for generalized Cartesian distributions via bipartite matching
Proceedings of the 8th ACM International Conference on Computing Frontiers
Causal cycle based communication pattern matching
ICDCN'10 Proceedings of the 11th international conference on Distributed computing and networking
Euro-Par 2010 Proceedings of the 2010 conference on Parallel processing
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Application-specific fault tolerance via data access characterization
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Controlling reversibility in higher-order Pi
CONCUR'11 Proceedings of the 22nd international conference on Concurrency theory
ReServE service: an approach to increase reliability in service oriented systems
PaCT'11 Proceedings of the 11th international conference on Parallel computing technologies
libhashckpt: hash-based incremental checkpointing using GPU's
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Checkpointing strategies for parallel jobs
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Evaluating the viability of process replication reliability for exascale systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Unstoppable stateful PHP web services
WISE'06 Proceedings of the 7th international conference on Web Information Systems
Robust web services via interaction contracts
TES'04 Proceedings of the 5th international conference on Technologies for E-Services
An efficient and scalable checkpointing and recovery algorithm for distributed systems
ICDCN'06 Proceedings of the 8th international conference on Distributed Computing and Networking
A hybrid fault tolerance scheme for EasyGrid MPI applications
Proceedings of the 9th International Workshop on Middleware for Grids, Clouds and e-Science
Safety of rollback-recovery protocol maintaining WFR session guarantee
ISCIS'06 Proceedings of the 21st international conference on Computer and Information Sciences
An intelligent management of fault tolerance in cluster using RADICMPI
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Extended mpijava for distributed checkpointing and recovery
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Checkpointing and communication pattern-neutral algorithm for removing messages logged by senders
HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
An operating system infrastructure for fault-tolerant reconfigurable networks
ARCS'06 Proceedings of the 19th international conference on Architecture of Computing Systems
An efficient computing-checkpoint based coordinated checkpoint algorithm
EUC'06 Proceedings of the 2006 international conference on Embedded and Ubiquitous Computing
Self-refined fault tolerance in HPC using dynamic dependent process groups
IWDC'05 Proceedings of the 7th international conference on Distributed Computing
A framework for automatic identification of the best checkpoint and recovery protocol
IWDC'04 Proceedings of the 6th international conference on Distributed Computing
Garbage collection in a causal message logging protocol
HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
Using computing checkpoints implement consistent low-cost non-blocking coordinated checkpointing
PDCAT'04 Proceedings of the 5th international conference on Parallel and Distributed Computing: applications and Technologies
Novel recovery mechanism for the restoration of image contents in teleconsultation sessions
Computer Methods and Programs in Biomedicine
Group communication: from practice to theory
SOFSEM'06 Proceedings of the 32nd conference on Current Trends in Theory and Practice of Computer Science
DAIS'06 Proceedings of the 6th IFIP WG 6.1 international conference on Distributed Applications and Interoperable Systems
Safety of recovery protocol preserving MW session guarantee in mobile systems
ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part IV
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
A checkpoint/recovery model for heterogeneous dataflow computations using work-stealing
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Checkpointing and migration of communication channels in heterogeneous grid environments
ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Determining consistent states of distributed objects participating in a remote method call
ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part I
Implementing rollback-recovery coordinated checkpoints
ISSADS'05 Proceedings of the 5th international conference on Advanced Distributed Systems
Performance evaluation of consistent recovery protocols using MPICH-GF
EDCC'05 Proceedings of the 5th European conference on Dependable Computing
Parallel checkpointing on a grid-enabled java platform
EGC'05 Proceedings of the 2005 European conference on Advances in Grid Computing
Dynamic failure management for parallel applications on grids
EGC'05 Proceedings of the 2005 European conference on Advances in Grid Computing
Efficient and coordinated checkpointing for reliable distributed data stream management
ADBIS'06 Proceedings of the 10th East European conference on Advances in Databases and Information Systems
Dependable Systems
Formal service-oriented development of fault tolerant communicating systems
Rigorous Development of Complex Fault-Tolerant Systems
Application-Level checkpointing techniques for parallel programs
ICDCIT'06 Proceedings of the Third international conference on Distributed Computing and Internet Technology
Hardware instruction counting for log-based rollback recovery on x86-family processors
ISAS'06 Proceedings of the Third international conference on Service Availability
A dead-lock free self-healing algorithm for distributed transactional processes
ICISS'06 Proceedings of the Second international conference on Information Systems Security
An efficient algorithm for removing useless logged messages in SBML protocols
ICDCIT'05 Proceedings of the Second international conference on Distributed Computing and Internet Technology
Analysis of interval-based global state detection
ICDCIT'05 Proceedings of the Second international conference on Distributed Computing and Internet Technology
Specification and synthesis of hardware checkpointing and rollback mechanisms
Proceedings of the 49th Annual Design Automation Conference
Consistent rollback protocols for autonomic ASSISTANT applications
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Cooperative Application/OS DRAM fault recovery
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
On the viability of checkpoint compression for extreme scale fault tolerance
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Impact of over-decomposition on coordinated checkpoint/rollback protocol
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Future Generation Computer Systems
Checkpoint scheduling model for optimality
Information Processing Letters
Massively-parallel stream processing under QoS constraints with Nephele
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Data-driven fault tolerance for work stealing computations
Proceedings of the 26th ACM international conference on Supercomputing
Independent checkpointing in a heterogeneous grid environment
Future Generation Computer Systems
Checkpointing Orchestration: Toward a Scalable HPC Fault-Tolerant Environment
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Deriving a unified fault taxonomy for event-based systems
Proceedings of the 6th ACM International Conference on Distributed Event-Based Systems
Composable reliability for asynchronous systems
USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
A multi-cycle checkpointing protocol that ensures strict 1-rollback
Information Processing Letters
NV-process: a fault-tolerance process model based on non-volatile memory
Proceedings of the Asia-Pacific Workshop on Systems
Transparent optimistic synchronization in the high-level architecture via time-management conversion
ACM Transactions on Modeling and Computer Simulation (TOMACS)
NV-process: a fault-tolerance process model based on non-volatile memory
APSys'12 Proceedings of the Third ACM SIGOPS Asia-Pacific conference on Systems
Automatic undo for cloud management via AI planning
HotDep'12 Proceedings of the Eighth USENIX conference on Hot Topics in System Dependability
McrEngine: a scalable checkpointing system using data-aware aggregation and compression
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Alleviating scalability issues of checkpointing protocols
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Themis: an I/O-efficient MapReduce
Proceedings of the Third ACM Symposium on Cloud Computing
enhancing fault-tolerance of large-scale MPI scientific applications
PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies
Proceedings of the Second International Conference on Computational Science, Engineering and Information Technology
On the optimality of rollback-recovery protocol preserving session guarantees
ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Low cost self-healing in MPI applications
PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
D-reserve: distributed reliable service environment
ADBIS'12 Proceedings of the 16th East European conference on Advances in Databases and Information Systems
Fault Tolerant Architecture to Cloud Computing Using Adaptive Checkpoint
International Journal of Cloud Applications and Computing
Abstractions and Middleware for Petascale Computing and Beyond
International Journal of Distributed Systems and Technologies
Soft-Checkpointing Based Hybrid Synchronous Checkpointing Protocol for Mobile Distributed Systems
International Journal of Distributed Systems and Technologies
Checkpointing SystemC-Based Virtual Platforms
International Journal of Embedded and Real-Time Communication Systems
A mid-career review of teaching computer science I
Proceeding of the 44th ACM technical symposium on Computer science education
The viability of using compression to decrease message log sizes
Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Proceedings of the 8th ACM European Conference on Computer Systems
Consistency guarantees for recovery of service-oriented distributed processing
International Journal of Intelligent Information and Database Systems
Evaluating the feasibility of using memory content similarity to improve system resilience
Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers
Memory array protection: check on read or check on write?
Proceedings of the Conference on Design, Automation and Test in Europe
Consistent and efficient output-streams management in optimistic simulation platforms
Proceedings of the 2013 ACM SIGSIM conference on Principles of advanced discrete simulation
BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds
Journal of Parallel and Distributed Computing
Automatic recovery from runtime failures
Proceedings of the 2013 International Conference on Software Engineering
A framework for self-healing software systems
Proceedings of the 2013 International Conference on Software Engineering
Rollback-recovery without checkpoints in distributed event processing systems
Proceedings of the 7th ACM international conference on Distributed event-based systems
Performance comparison under failures of MPI and MapReduce: An analytical approach
Future Generation Computer Systems
Arrakis: a case for the end of the empire
HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
SPBC: leveraging the characteristics of MPI HPC applications for scalable checkpointing
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
"All roads lead to Rome": optimistic recovery for distributed iterative data processing
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Customizable execution environments for evolutionary computation using BOINC + virtualization
Natural Computing: an international journal
Program transformation techniques applied to languages used in high performance computing
Proceedings of the 2013 companion publication for conference on Systems, programming, & applications: software for humanity
The Journal of Supercomputing
Post-failure recovery of MPI communication capability: Design and rationale
International Journal of High Performance Computing Applications
Exception handlers for healing component-based systems
ACM Transactions on Software Engineering and Methodology (TOSEM) - Testing, debugging, and error handling, formal methods, lifecycle concerns, evolution and maintenance
Online checkpointing with improved worst-case guarantees
ICALP'13 Proceedings of the 40th international conference on Automata, Languages, and Programming - Volume Part I
Multi-criteria checkpointing strategies: response-time versus resource utilization
Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Evaluating energy savings for checkpoint/restart
E2SC '13 Proceedings of the 1st International Workshop on Energy Efficient Supercomputing
HotSnap: a hot distributed snapshot system for virtual machine cluster
LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
Supporting undoability in systems operations
LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
Software health management with Bayesian networks
Innovations in Systems and Software Engineering
Compiler-Assisted Checkpointing of Parallel Codes: The Cetus and LLVM Experience
International Journal of Parallel Programming
Accelerating incremental checkpointing for extreme-scale computing
Future Generation Computer Systems
X10-FT: Transparent fault tolerance for APGAS language and runtime
Parallel Computing
Orphan-Free Consistent Condition for Log-Based Checkpointing and Rollback Recovery Scheme
International Journal of Advanced Pervasive and Ubiquitous Computing
McrEngine: A scalable checkpointing system using data-aware aggregation and compression
Scientific Programming - Selected Papers from Super Computing 2012
Nephele streaming: stream processing under QoS constraints at scale
Cluster Computing
Hi-index | 0.01 |
This survey covers rollback-recovery techniques that do not require special language constructs. In the first part of the survey we classify rollback-recovery protocols into checkpoint-based and log-based. Checkpoint-based protocols rely solely on checkpointing for system state restoration. Checkpointing can be coordinated, uncoordinated, or communication-induced. Log-based protocols combine checkpointing with logging of nondeterministic events, encoded in tuples called determinants. Depending on how determinants are logged, log-based protocols can be pessimistic, optimistic, or causal. Throughout the survey, we highlight the research issues that are at the core of rollback-recovery and present the solutions that currently address them. We also compare the performance of different rollback-recovery protocols with respect to a series of desirable properties and discuss the issues that arise in the practical implementations of these protocols.