Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
Data Diversity: An Approach to Software Fault Tolerance
IEEE Transactions on Computers - Fault-Tolerant Computing
Garbage collection in an uncooperative environment
Software—Practice & Experience
Proceedings of the Twenty-First Annual Hawaii International Conference on Software Track
Transparent process migration: design alternatives and the sprite implementation
Software—Practice & Experience
Diagnosis and correction of logic design errors in digital circuits
DAC '93 Proceedings of the 30th international Design Automation Conference
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Efficient and effective placement for very large circuits
ICCAD '93 Proceedings of the 1993 IEEE/ACM international conference on Computer-aided design
Algorithms and Techniques for VLSI Layout and Synthesis
Algorithms and Techniques for VLSI Layout and Synthesis
VHLLS'94 Proceedings of the USENIX 1994 Very High Level Languages Symposium Proceedings on USENIX 1994 Very High Level Languages Symposium Proceedings
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
The N-Version Approach to Fault-Tolerant Software
IEEE Transactions on Software Engineering
Timing and area optimization for standard-cell VLSI circuit design
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Minimizing completion time of a program by checkpointing and rejuvenation
Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints
IEEE Transactions on Computers
Support for Software Interrupts in Log-Based Rollback-Recovery
IEEE Transactions on Computers
IEEE Transactions on Parallel and Distributed Systems
The Journal of Supercomputing
Low-Cost Error Containment and Recovery for Onboard Guarded Software Upgrading and Beyond
IEEE Transactions on Computers - Special issue on fault-tolerant embedded systems
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Virtual-machine-based heterogeneous checkpointing
Software—Practice & Experience
Process Recovery in Heterogeneous Systems
IEEE Transactions on Computers
Virtual Machine Based Heterogeneous Checkpointing
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
User-Level Checkpointing for LinuxThreads Programs
Proceedings of the FREENIX Track: 2001 USENIX Annual Technical Conference
State Synchronization and Recovery for Strongly Consistent Replicated CORBA Objects
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
The Design and Use of Persistent Memory on the DNCP Hardware Fault-Tolerant Platform
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Distributed Checkpointing Mechanism for a Parallel File System
Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Eternal: a component-based framework for transparent fault-tolerant CORBA
Software—Practice & Experience - Special issue: Enterprise frameworks
Journal of Systems Architecture: the EUROMICRO Journal
Supporting nondeterministic execution in fault-tolerant systems
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Evaluation of checkpoint mechanisms for massively parallel machines
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
An Adaptive Checkpointing Protocol to Bound Recovery Time with Message Logging
SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Xept: A Software Instrumentation Method For Exception Handling
ISSRE '97 Proceedings of the Eighth International Symposium on Software Reliability Engineering
Software Rejuvenation: Analysis, Module and Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Why Optimistic Message Logging Has Not Been Used in Telecommunications Systems
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Experiences, Strategies, and Challenges in Building Fault-Tolerant CORBA Systems
IEEE Transactions on Computers
Improving availability with recursive microreboots: a soft-state system case study
Performance Evaluation - Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers
Building Intrusion-Tolerant Secure Software
Proceedings of the international symposium on Code generation and optimization
Current Practice and a Direction Forward in Checkpoint/Restart Implementations for Fault Tolerance
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
A Comprehensive Model for Software Rejuvenation
IEEE Transactions on Dependable and Secure Computing
Analyzing Component-Based Systems Using the Self-Organizing Map
EUROMICRO '05 Proceedings of the 31st EUROMICRO Conference on Software Engineering and Advanced Applications
Rx: treating bugs as allergies---a safe method to survive software failures
Proceedings of the twentieth ACM symposium on Operating systems principles
Autonomous recovery in componentized Internet applications
Cluster Computing
An Integrated Framework for Dependable and Revivable Architectures Using Multicore Processors
Proceedings of the 33rd annual international symposium on Computer Architecture
Inference and enforcement of data structure consistency specifications
Proceedings of the 2006 international symposium on Software testing and analysis
Flashback: a lightweight extension for rollback and deterministic replay for software debugging
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
NT-SwiFT: software implemented fault tolerance on windows NT
WINSYM'98 Proceedings of the 2nd conference on USENIX Windows NT Symposium - Volume 2
Kernel support for zero-loss Internet service restart
Software—Practice & Experience
Rx: Treating bugs as allergies—a safe method to survive software failures
ACM Transactions on Computer Systems (TOCS)
IEEE Transactions on Parallel and Distributed Systems
Migration of software partition in UNIX system
COMPUTE '08 Proceedings of the 1st Bangalore Annual Compute Conference
Decision support for virtual machine re-provisioning in production environments
LISA'07 Proceedings of the 21st conference on Large Installation System Administration Conference
Fault tolerant algorithms for heat transfer problems
Journal of Parallel and Distributed Computing
Memory performance attacks: denial of memory service in multi-core systems
SS'07 Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium
CprFS: a user-level file system to support consistent file states for checkpoint and restart
Proceedings of the 22nd annual international conference on Supercomputing
Handling Persistent States in Process Checkpoint/Restart Mechanisms for HPC Systems
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Timetraveler: exploiting acyclic races for optimizing memory race recording
Proceedings of the 37th annual international symposium on Computer architecture
Relax: an architectural framework for software recovery of hardware faults
Proceedings of the 37th annual international symposium on Computer architecture
Compiler support for fine-grain software-only checkpointing
CC'12 Proceedings of the 21st international conference on Compiler Construction
A survey of software aging and rejuvenation studies
ACM Journal on Emerging Technologies in Computing Systems (JETC) - Special Issue on Reliability and Device Degradation in Emerging Technologies and Special Issue on WoSAR 2011
Hi-index | 0.01 |
Abstract: The paper describes our experience with the implementation and applications of the Unix checkpointing library libckp, and identifies two concepts that have proven to be the key to making checkpointing a powerful tool. First, including all persistent states, i.e., user files, as part of the process state that can be checkpointed and recovered provides a truly transparent and consistent rollback. Second, excluding part of the persistent state from the process state allows user programs to process future inputs from a desirable state, which leads to interesting new applications of checkpointing. We use real-life examples to demonstrate the use of libckp for bypassing premature software exits, for fast initialization and for memory rejuvenation.