Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
Efficient distributed recovery using message logging
Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Efficient checkpointing on MIMD architectures
Efficient checkpointing on MIMD architectures
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Application level fault tolerance in heterogeneous networks of workstations
Journal of Parallel and Distributed Computing
Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme
IEEE Transactions on Computers
Fail-stop processors: an approach to designing fault-tolerant computing systems
ACM Transactions on Computer Systems (TOCS)
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
An Analysis of Communication-Induced Checkpointing
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Converting a swap-based system to do paging in an architecture lacking page-referenced bits
SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
SPDP '96 Proceedings of the 8th IEEE Symposium on Parallel and Distributed Processing (SPDP '96)
Preventing Useless Checkpoints in Distributed Computations
SRDS '97 Proceedings of the 16th Symposium on Reliable Distributed Systems
Why Optimistic Message Logging Has Not Been Used in Telecommunications Systems
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Cooperative checkpointing: a robust approach to large-scale systems reliability
Proceedings of the 20th annual international conference on Supercomputing
Automated application-level checkpointing based on live-variable analysis in MPI programs
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage
Information Sciences: an International Journal
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage
Information Sciences: an International Journal
Experimental Assessment of the Practicality of a Fault-Tolerant System
SOFSEM '07 Proceedings of the 33rd conference on Current Trends in Theory and Practice of Computer Science
Analytical study of migration-enhanced fault tolerance for long-running applications in IFR systems
International Journal of Parallel, Emergent and Distributed Systems
Algorithm-based fault tolerance applied to high performance computing
Journal of Parallel and Distributed Computing
A systematic approach to system state restoration during storage controller micro-recovery
FAST '09 Proccedings of the 7th conference on File and storage technologies
Modeling and Analysis of Checkpoint I/O Operations
ASMTA '09 Proceedings of the 16th International Conference on Analytical and Stochastic Modeling Techniques and Applications
ICS'08 Proceedings of the 12th WSEAS international conference on Systems
PLFS: a checkpoint filesystem for parallel applications
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A study of dynamic meta-learning for failure prediction in large-scale systems
Journal of Parallel and Distributed Computing
Performance evaluation of an application-level checkpointing solution on grids
Future Generation Computer Systems
A flexible checkpoint/restart model in distributed systems
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Cooperative checkpointing theory
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Application-specific fault tolerance via data access characterization
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Adaptive event prediction strategy with dynamic time window for large-scale HPC systems
SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Evaluating the viability of process replication reliability for exascale systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Simulating application resilience at exascale
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
On the viability of checkpoint compression for extreme scale fault tolerance
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Future Generation Computer Systems
Checkpoint scheduling model for optimality
Information Processing Letters
Data-driven fault tolerance for work stealing computations
Proceedings of the 26th ACM international conference on Supercomputing
Ensuring reliability in B2B services: Fault tolerant inter-organizational workflows
Information Systems Frontiers
USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
A multi-cycle checkpointing protocol that ensures strict 1-rollback
Information Processing Letters
Tuple switching network-When slower may be better
Journal of Parallel and Distributed Computing
McrEngine: a scalable checkpointing system using data-aware aggregation and compression
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Improving Bandwidth Efficiency for Consistent Multistream Storage
ACM Transactions on Storage (TOS)
A 1 PB/s file system to checkpoint three million MPI tasks
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
When is multi-version checkpointing needed?
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Evaluating the feasibility of using memory content similarity to improve system resilience
Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers
McrEngine: A scalable checkpointing system using data-aware aggregation and compression
Scientific Programming - Selected Papers from Super Computing 2012
Hi-index | 0.00 |
Over the past two decades, rollback-recovery via checkpoint-restart has been used with reasonable success for long-running applications, such as scientific workloads that take from few hours to few months to complete. Currently, several commercial systems and publicly available libraries exist to support various flavors of checkpointing. Programmers typically use these systems if they are satisfactory or otherwise embed checkpointing support themselves within the application. In this paper, we project the performance and functionality of checkpointing algorithms and systems as we know them today into the future. We start by surveying the current technology roadmap and particularly how Peta-Flop capable systems may be plausibly constructed in the next few years. We consider how rollback-recovery as practiced today will fare when systems may have to be constructed out of thousands of nodes. Our projections predict that, unlike current practice, the effect of rollback-recovery may play a more prominent role in how systems may be configured to reach the desired performance level. System planners may have to devote additional resources to enable rollback-recovery and the current practice of using "cheap commodity驴 systems to form large-scale clusters may face serious obstacles. We suggest new avenues for research to react to these trends.