Failure correction techniques for large disk arrays
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
IGOR: a system for program debugging via reversible execution
PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
Demonic memory for process histories
PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
Virtual memory primitives for user programs
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Redundant disk arrays: reliable, parallel secondary storage
Redundant disk arrays: reliable, parallel secondary storage
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Introduction to parallel computing: design and analysis of algorithms
Introduction to parallel computing: design and analysis of algorithms
Parallelization of the fast multipole algorithm: algorithm and architecture design
Parallelization of the fast multipole algorithm: algorithm and architecture design
RAID: high-performance, reliable secondary storage
ACM Computing Surveys (CSUR)
EVENODD: an optimal scheme for tolerating double disk failures in RAID architectures
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
A case for two-level distributed recovery schemes
Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Lightweight logging for lazy release consistent distributed shared memory
OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Fault-tolerant matrix operations for networks of workstations using diskless checkpointing
Journal of Parallel and Distributed Computing
Application level fault tolerance in heterogeneous networks of workstations
Journal of Parallel and Distributed Computing
A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems
Software—Practice & Experience
Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme
IEEE Transactions on Computers
ickp: A Consistent Checkpointer for Multicomputers
IEEE Parallel & Distributed Technology: Systems & Technology
Low-Latency, Concurrent Checkpointing for Parallel Programs
IEEE Transactions on Parallel and Distributed Systems
Evaluation of checkpoint mechanisms for massively parallel machines
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Fault Tolerant Matrix Operations for Networks of Workstations Using Multiple Checkpointing
HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
A longitudinal survey of Internet host reliability
SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Checkpointing and Its Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Fault Tolerance for Off-the-Shelf Applications and Hardware
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Transparent fault tolerance for parallel applications on networks of workstations
ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
A Variational Calculus Approach to Optimal Checkpoint Placement
IEEE Transactions on Computers
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Distributed Checkpointing on Clusters with Dynamic Striping and Staggering
ASIAN '02 Proceedings of the7th Asian Computing Science Conference on Advances in Computing Science: Internet Computing and Modeling, Grid Computing, Peer-to-Peer Computing, and Cluster
Dynamic Data Replication: An Approach to Providing Fault-Tolerant Shared Memory Clusters
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Quantifying rollback propagation in distributed checkpointing
Journal of Parallel and Distributed Computing
Adaptive incremental checkpointing for massively parallel systems
Proceedings of the 18th annual international conference on Supercomputing
Fault tolerant high performance computing by a coding approach
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Space-efficient page-level incremental checkpointing
Proceedings of the 2005 ACM symposium on Applied computing
Rx: treating bugs as allergies---a safe method to survive software failures
Proceedings of the twentieth ACM symposium on Operating systems principles
Strategies for storage of checkpointing data using non-dedicated repositories on Grid systems
MGC '05 Proceedings of the 3rd international workshop on Middleware for grid computing
Reliability challenges in large systems
Future Generation Computer Systems
A new approach to real-time checkpointing
Proceedings of the 2nd international conference on Virtual execution environments
Adaptive page-level incremental checkpointing based on expected recovery time
Proceedings of the 2006 ACM symposium on Applied computing
Strategies for Checkpoint Storage on Opportunistic Grids
IEEE Distributed Systems Online
Architecture of a Self-Checkpointing Microprocessor that Incorporates Nanomagnetic Devices
IEEE Transactions on Computers
Flashback: a lightweight extension for rollback and deterministic replay for software debugging
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Failure-aware checkpointing in fine-grained cycle sharing systems
Proceedings of the 16th international symposium on High performance distributed computing
Rx: Treating bugs as allergies—a safe method to survive software failures
ACM Transactions on Computer Systems (TOCS)
Analytical study of migration-enhanced fault tolerance for long-running applications in IFR systems
International Journal of Parallel, Emergent and Distributed Systems
Numerical computation algorithms for sequential checkpoint placement
Performance Evaluation
Self-recovery in server programs
Proceedings of the 2009 international symposium on Memory management
International Journal of High Performance Computing Applications
In-Memory Checkpointing for MPI Programs by XOR-Based Double-Erasure Codes
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
International Journal of High Performance Computing Applications
PLFS: a checkpoint filesystem for parallel applications
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Optimal real number codes for fault tolerant matrix operations
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Reliability challenges in large systems
Future Generation Computer Systems
Design and performance evaluation of enhanced two-level recovery scheme
PDCN '08 Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks
ICCS'03 Proceedings of the 2003 international conference on Computational science: PartII
Diet: new developments and recent results
Euro-Par'06 Proceedings of the CoreGRID 2006, UNICORE Summit 2006, Petascale Computational Biology and Bioinformatics conference on Parallel processing
On checkpoint overhead in distributed systems providing session guarantees
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Relax: an architectural framework for software recovery of hardware faults
Proceedings of the 37th annual international symposium on Computer architecture
Distributed Diskless Checkpoint for Large Scale Systems
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Adaptive and Speculative Slack Simulations of CMPs on CMPs
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Algorithm-based recovery for HPL
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Hybrid checkpointing using emerging nonvolatile memories for future exascale systems
ACM Transactions on Architecture and Code Optimization (TACO)
High performance linpack benchmark: a fault tolerant implementation without checkpointing
Proceedings of the international conference on Supercomputing
Algorithm-based recovery for iterative methods without checkpointing
Proceedings of the 20th international symposium on High performance distributed computing
Tolerating correlated failures for generalized Cartesian distributions via bipartite matching
Proceedings of the 8th ACM International Conference on Computing Frontiers
FTI: high performance fault tolerance interface for hybrid systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Analyzing fault aware collective performance in a process fault tolerant MPI
Parallel Computing
Adaptive mobile checkpointing facility for wireless sensor networks
ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part II
Robust distributed orthogonalization based on randomized aggregation
Proceedings of the second workshop on Scalable algorithms for large-scale systems
Fault tolerant matrix-matrix multiplication: correcting soft errors on-line
Proceedings of the second workshop on Scalable algorithms for large-scale systems
Algorithm-based fault tolerance for dense matrix factorizations
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Performance evaluation of consistent recovery protocols using MPICH-GF
EDCC'05 Proceedings of the 5th European conference on Dependable Computing
On the viability of checkpoint compression for extreme scale fault tolerance
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Energy-aware I/O optimization for checkpoint and restart on a NAND flash memory system
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Parallel reduction to hessenberg form with algorithm-based fault tolerance
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Evaluating energy savings for checkpoint/restart
E2SC '13 Proceedings of the 1st International Workshop on Energy Efficient Supercomputing
X10-FT: Transparent fault tolerance for APGAS language and runtime
Parallel Computing
Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems
Scientific Programming - Selected Papers from Super Computing 2012
Hi-index | 0.01 |
Diskless Checkpointing is a technique for checkpointing the state of a long-running computation on a distributed system without relying on stable storage. As such, it eliminates the performance bottleneck of traditional checkpointing on distributed systems. In this paper, we motivate diskless checkpointing and present the basic diskless checkpointing scheme along with several variants for improved performance. The performance of the basic scheme and its variants is evaluated on a high-performance network of workstations and compared to traditional disk-based checkpointing. We conclude that diskless checkpointing is a desirable alternative to disk-based checkpointing that can improve the performance of distributed applications in the face of failures.