IGOR: a system for program debugging via reversible execution
PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
Supporting reverse execution for parallel programs
PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
Demonic memory for process histories
PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
Real-time, concurrent checkpoint for parallel programs
PPOPP '90 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming
Advanced programming in the UNIX environment
Advanced programming in the UNIX environment
LAPACK's user's guide
The integration of virtual memory management and interprocess communication in Accent
ACM Transactions on Computer Systems (TOCS)
Preemptable remote execution facilities for the V-system
Proceedings of the tenth ACM symposium on Operating systems principles
ickp: A Consistent Checkpointer for Multicomputers
IEEE Parallel & Distributed Technology: Systems & Technology
Low-Latency, Concurrent Checkpointing for Parallel Programs
IEEE Transactions on Parallel and Distributed Systems
Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery
Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery
ACM SIGOPS Operating Systems Review
Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme
IEEE Transactions on Computers
Efficient and flexible fault tolerance and migration of scientific simulations using CUMULVS
SPDT '98 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
Fast cluster failover using virtual memory-mapped communication
ICS '99 Proceedings of the 13th international conference on Supercomputing
Quasi-asynchronous migration: a novel migration protocol for PVM tasks
ACM SIGOPS Operating Systems Review
Staggered Consistent Checkpointing
IEEE Transactions on Parallel and Distributed Systems
The Journal of Supercomputing
User-level process checkpoint and restore for migration
ACM SIGOPS Operating Systems Review
The implementation of dynamite: an environment for migrating PVM tasks
ACM SIGOPS Operating Systems Review
CLIP: a checkpointing tool for message-passing parallel programs
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Fault-Tolerant File-I/O for Portable Checkpointing Systems
The Journal of Supercomputing - Special issue on embedded fault-tolerance systems
Virtual-machine-based heterogeneous checkpointing
Software—Practice & Experience
Portable and Fault-Tolerant Software Systems
IEEE Micro
Process Recovery in Heterogeneous Systems
IEEE Transactions on Computers
Performance Evaluation of Fault Tolerance for Parallel Applications in Networked Environments
ICPP '97 Proceedings of the international Conference on Parallel Processing
Virtual Machine Based Heterogeneous Checkpointing
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
PLATINUM: A Placement Tool Based on Process Initiative
PaCT '999 Proceedings of the 5th International Conference on Parallel Computing Technologies
Online Non-stop Software Update Using Replicated Execution Blocks
COMPSAC '00 24th International Computer Software and Applications Conference
Distributed Checkpointing on Clusters with Dynamic Striping and Staggering
ASIAN '02 Proceedings of the7th Asian Computing Science Conference on Advances in Computing Science: Internet Computing and Modeling, Grid Computing, Peer-to-Peer Computing, and Cluster
User-Level Checkpointing for LinuxThreads Programs
Proceedings of the FREENIX Track: 2001 USENIX Annual Technical Conference
The Design and Use of Persistent Memory on the DNCP Hardware Fault-Tolerant Platform
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
A Functional Approach to External Graph Algorithms
ESA '98 Proceedings of the 6th Annual European Symposium on Algorithms
Distributed Checkpointing Mechanism for a Parallel File System
Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
TH-MPI: OS Kernel Integrated Fault Tolerant MPI
Proceedings of the 8th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Transparent Orthogonal Checkpointing through User-Level Pagers
POS-9 Revised Papers from the 9th International Workshop on Persistent Object Systems
Journal of Systems Architecture: the EUROMICRO Journal
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Supporting fault-tolerance in heterogeneous distributed applications
HCW '97 Proceedings of the 6th Heterogeneous Computing Workshop (HCW '97)
Portable transparent checkpointing for distributed shared memory
HPDC '96 Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing
SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
An Adaptive Checkpointing Protocol to Bound Recovery Time with Message Logging
SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
ReVirt: enabling intrusion analysis through virtual-machine logging and replay
ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
The design and implementation of Zap: a system for migrating computing environments
ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
Optimizing the migration of virtual computers
ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
Micro-Checkpointing: Checkpointing for Multithreaded Applications
IOLTW '00 Proceedings of the 6th IEEE International On-Line Testing Workshop (IOLTW)
User-level checkpointing through exportable kernel state
IWOOOS '96 Proceedings of the 5th International Workshop on Object Orientation in Operating Systems (IWOOOS '96)
Why Optimistic Message Logging Has Not Been Used in Telecommunications Systems
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Checkpointing and Its Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Nonblocking Checkpointing for Optimistic Parallel Simulation: Description and an Implementation
IEEE Transactions on Parallel and Distributed Systems
Task Feasibility Analysis and Dynamic Voltage Scaling in Fault-Tolerant Real-Time Embedded Systems
Proceedings of the conference on Design, automation and test in Europe - Volume 2
Future Generation Computer Systems - Special issue: Advanced services for clusters and internet computing
MobiDesk: mobile virtual desktop computing
Proceedings of the 10th annual international conference on Mobile computing and networking
Checkpointing-based rollback recovery for parallel applications on the InteGrade grid middleware
MGC '04 Proceedings of the 2nd workshop on Middleware for grid computing
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery
IEEE Transactions on Dependable and Secure Computing
Current Practice and a Direction Forward in Checkpoint/Restart Implementations for Fault Tolerance
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
ReVirt: enabling intrusion analysis through virtual-machine logging and replay
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
The design and implementation of Zap: a system for migrating computing environments
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Optimizing the migration of virtual computers
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Turducken: hierarchical power management for mobile devices
Proceedings of the 3rd international conference on Mobile systems, applications, and services
Jockey: a user-space library for record-replay debugging
Proceedings of the sixth international symposium on Automated analysis-driven debugging
Backtracking and dynamic patching for free
Proceedings of the sixth international symposium on Automated analysis-driven debugging
A Version of MASM Portable Across Different UNIX Systems and Different Hardware Architectures
DS-RT '05 Proceedings of the 9th IEEE International Symposium on Distributed Simulation and Real-Time Applications
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
A new approach to real-time checkpointing
Proceedings of the 2nd international conference on Virtual execution environments
Transparent Optimistic Synchronization in HLA via a Time-Management Converter
Proceedings of the 20th Workshop on Principles of Advanced and Distributed Simulation
Adaptive page-level incremental checkpointing based on expected recovery time
Proceedings of the 2006 ACM symposium on Applied computing
The overhead model of word-level and page-level incremental checkpointing
Proceedings of the 2006 ACM symposium on Applied computing
Log-based rollback recovery without checkpoints of shared memory in software DSM
The Journal of Supercomputing
A wide-area Distribution Network for free software
ACM Transactions on Internet Technology (TOIT)
ACM Transactions on Computer Systems (TOCS)
Architecture of a Self-Checkpointing Microprocessor that Incorporates Nanomagnetic Devices
IEEE Transactions on Computers
Integrating coordinated checkpointing and recovery mechanisms into DSM synchronization barriers
Journal of Experimental Algorithmics (JEA)
Multiprogrammed non-blocking checkpoints in support of optimistic simulation on myrinet clusters
Journal of Systems Architecture: the EUROMICRO Journal
Goal-Directed Reasoning for Specification-Based Data Structure Repair
IEEE Transactions on Software Engineering
Reducing downtime due to system maintenance and upgrades
LISA '05 Proceedings of the 19th conference on Large Installation System Administration Conference - Volume 19
Exploring failure transparency and the limits of generic recovery
OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Cloneable JVM: a new approach to start isolated java applications faster
Proceedings of the 3rd international conference on Virtual execution environments
A transparent checkpoint facility on NT
WINSYM'98 Proceedings of the 2nd conference on USENIX Windows NT Symposium - Volume 2
Kernel support for zero-loss Internet service restart
Software—Practice & Experience
DejaView: a personal virtual computer recorder
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
ER-TCP: an efficient TCP fault-tolerance scheme for cluster computing
The Journal of Supercomputing
Decision support for virtual machine re-provisioning in production environments
LISA'07 Proceedings of the 21st conference on Large Installation System Administration Conference
Model-based performance evaluation of distributed checkpointing protocols
Performance Evaluation
Transparent checkpoint-restart of multiple processes on commodity operating systems
ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
CprFS: a user-level file system to support consistent file states for checkpoint and restart
Proceedings of the 22nd annual international conference on Supercomputing
Proactive process-level live migration in HPC environments
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Proceedings of the 1st international conference on Simulation tools and techniques for communications, networks and systems & workshops
Bristlecone: A Language for Robust Software Systems
ECOOP '08 Proceedings of the 22nd European conference on Object-Oriented Programming
Optimization of checkpointing-related I/O for high-performance parallel and distributed computing
The Journal of Supercomputing
Fault-tolerant stream processing using a distributed, replicated file system
Proceedings of the VLDB Endowment
Post-copy based live virtual machine migration using adaptive pre-paging and dynamic self-ballooning
Proceedings of the 2009 ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
First-aid: surviving and preventing memory management bugs during production runs
Proceedings of the 4th ACM European conference on Computer systems
A Checkpointing Method with Small Checkpoint Latency
IEICE - Transactions on Information and Systems
GPC '09 Proceedings of the 4th International Conference on Advances in Grid and Pervasive Computing
The Architecture of the XtreemOS Grid Checkpointing Service
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Post-copy live migration of virtual machines
ACM SIGOPS Operating Systems Review
A checkpoint/restore framework for systemC-based virtual platforms
SOC'09 Proceedings of the 11th international conference on System-on-chip
ICCS'03 Proceedings of the 2003 international conference on Computational science: PartII
Distributed Diskless Checkpoint for Large Scale Systems
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
HotPower'08 Proceedings of the 2008 conference on Power aware computing and systems
Design techniques for cross-layer resilience
Proceedings of the Conference on Design, Automation and Test in Europe
Recent advances in checkpoint/recovery systems
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Intrusion recovery using selective re-execution
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Mementos: system support for long-running computation on RFID-scale devices
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Fast and space-efficient virtual machine checkpointing
Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
International Journal of High Performance Computing Applications
Fast checkpoint recovery algorithms for frequently consistent applications
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
BRRL: a recovery library for main-memory applications in the cloud
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Tolerating correlated failures for generalized Cartesian distributions via bipartite matching
Proceedings of the 8th ACM International Conference on Computing Frontiers
libhashckpt: hash-based incremental checkpointing using GPU's
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
FTI: high performance fault tolerance interface for hybrid systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Towards checkpointing grid architecture
PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
A comparative study at the logical level of centralised and distributed recovery in clusters
ICA3PP'05 Proceedings of the 6th international conference on Algorithms and Architectures for Parallel Processing
New user-guided and ckpt-based checkpointing libraries for parallel MPI applications,
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Proactive process-level live migration and back migration in HPC environments
Journal of Parallel and Distributed Computing
Adaptive mobile checkpointing facility for wireless sensor networks
ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part II
Impact on the writing granularity for incremental checkpointing
FSKD'05 Proceedings of the Second international conference on Fuzzy Systems and Knowledge Discovery - Volume Part II
Parallel checkpoint/recovery on cluster of IA-64 computers
ISPA'04 Proceedings of the Second international conference on Parallel and Distributed Processing and Applications
A hybrid message Logging-CIC protocol for constrained checkpointability
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
A secure checkpointing protocol for survivable server design
ICDCIT'04 Proceedings of the First international conference on Distributed Computing and Internet Technology
Performance evaluation of consistent recovery protocols using MPICH-GF
EDCC'05 Proceedings of the 5th European conference on Dependable Computing
Integrating coordinated checkpointing and recovery mechanisms into DSM synchronization barriers
WEA'05 Proceedings of the 4th international conference on Experimental and Efficient Algorithms
Transparent checkpointing for applications with graphical user interfaces
ISAS'06 Proceedings of the Third international conference on Service Availability
Specification and synthesis of hardware checkpointing and rollback mechanisms
Proceedings of the 49th Annual Design Automation Conference
Compiler support for fine-grain software-only checkpointing
CC'12 Proceedings of the 21st international conference on Compiler Construction
USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Transparent optimistic synchronization in the high-level architecture via time-management conversion
ACM Transactions on Modeling and Computer Simulation (TOMACS)
Self-healing multitier architectures using cascading rescue points
Proceedings of the 28th Annual Computer Security Applications Conference
Checkpointing SystemC-Based Virtual Platforms
International Journal of Embedded and Real-Time Communication Systems
Lightweight snapshots and system-level backtracking
HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
The Journal of Supercomputing
Compiler-Assisted Checkpointing of Parallel Codes: The Cetus and LLVM Experience
International Journal of Parallel Programming
Accelerating incremental checkpointing for extreme-scale computing
Future Generation Computer Systems
A RULE-BASED DOMAIN SPECIFIC LANGUAGE FOR FAULT MANAGEMENT
Journal of Integrated Design & Process Science
Hi-index | 0.01 |
Checkpointing is a simple technique for rollback recovery: the state of an executing program is periodically saved to a disk file from which it can be recovered after a failure. While recent research has developed a collection of powerful techniques for minimizing the overhead of writing checkpoint files, checkpointing remains unavailable to most application developers. In this paper we describe libckpt, a portable checkpointing tool for Unix that implements all applicable performance optimizations which are reported in the literature. While libckpt can be used in a mode which is almost totally transparent to the programmer, it also supports the incorporation of user directives into the creation of checkpoints. This user-directed checkpointing is an innovation which is unique to our work.