Diskless Checkpointing

Authors:
James S. Plank;Kai Li;Michael A. Puening
Affiliations:
Univ. of Tennessee, Knoxville;Princeton Univ., Princeton, NJ;Cardinal Solutions Group, Inc., Cincinnati, OH
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
1998

Citing 26
Cited 60

Failure correction techniques for large disk arrays

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
IGOR: a system for program debugging via reversible execution

PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
Demonic memory for process histories

PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
Virtual memory primitives for user programs

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Redundant disk arrays: reliable, parallel secondary storage

Redundant disk arrays: reliable, parallel secondary storage
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Introduction to parallel computing: design and analysis of algorithms

Introduction to parallel computing: design and analysis of algorithms
Parallelization of the fast multipole algorithm: algorithm and architecture design

Parallelization of the fast multipole algorithm: algorithm and architecture design
RAID: high-performance, reliable secondary storage

ACM Computing Surveys (CSUR)
EVENODD: an optimal scheme for tolerating double disk failures in RAID architectures

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing

PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
A case for two-level distributed recovery schemes

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Lightweight logging for lazy release consistent distributed shared memory

OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Fault-tolerant matrix operations for networks of workstations using diskless checkpointing

Journal of Parallel and Distributed Computing
Application level fault tolerance in heterogeneous networks of workstations

Journal of Parallel and Distributed Computing
A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems

Software—Practice & Experience
Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme

IEEE Transactions on Computers
ickp: A Consistent Checkpointer for Multicomputers

IEEE Parallel & Distributed Technology: Systems & Technology
Low-Latency, Concurrent Checkpointing for Parallel Programs

IEEE Transactions on Parallel and Distributed Systems
Evaluation of checkpoint mechanisms for massively parallel machines

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Fault Tolerant Matrix Operations for Networks of Workstations Using Multiple Checkpointing

HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
A longitudinal survey of Internet host reliability

SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques

SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Checkpointing and Its Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Fault Tolerance for Off-the-Shelf Applications and Hardware

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Transparent fault tolerance for parallel applications on networks of workstations

ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference

A Variational Calculus Approach to Optimal Checkpoint Placement

IEEE Transactions on Computers
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Designing SSI Clusters with Hierarchical Checkpointing and Single I/O Space

IEEE Concurrency
Distributed Checkpointing on Clusters with Dynamic Striping and Staggering

ASIAN '02 Proceedings of the7th Asian Computing Science Conference on Advances in Computing Science: Internet Computing and Modeling, Grid Computing, Peer-to-Peer Computing, and Cluster
Dynamic Data Replication: An Approach to Providing Fault-Tolerant Shared Memory Clusters

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Quantifying rollback propagation in distributed checkpointing

Journal of Parallel and Distributed Computing
Adaptive incremental checkpointing for massively parallel systems

Proceedings of the 18th annual international conference on Supercomputing
Fault tolerant high performance computing by a coding approach

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Space-efficient page-level incremental checkpointing

Proceedings of the 2005 ACM symposium on Applied computing
Rx: treating bugs as allergies---a safe method to survive software failures

Proceedings of the twentieth ACM symposium on Operating systems principles
Strategies for storage of checkpointing data using non-dedicated repositories on Grid systems

MGC '05 Proceedings of the 3rd international workshop on Middleware for grid computing
Reliability challenges in large systems

Future Generation Computer Systems
A new approach to real-time checkpointing

Proceedings of the 2nd international conference on Virtual execution environments
Adaptive page-level incremental checkpointing based on expected recovery time

Proceedings of the 2006 ACM symposium on Applied computing
Strategies for Checkpoint Storage on Opportunistic Grids

IEEE Distributed Systems Online
Architecture of a Self-Checkpointing Microprocessor that Incorporates Nanomagnetic Devices

IEEE Transactions on Computers
Flashback: a lightweight extension for rollback and deterministic replay for software debugging

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Failure-aware checkpointing in fine-grained cycle sharing systems

Proceedings of the 16th international symposium on High performance distributed computing
Rx: Treating bugs as allergies—a safe method to survive software failures

ACM Transactions on Computer Systems (TOCS)
Analytical study of migration-enhanced fault tolerance for long-running applications in IFR systems

International Journal of Parallel, Emergent and Distributed Systems
Numerical computation algorithms for sequential checkpoint placement

Performance Evaluation
Self-recovery in server programs

Proceedings of the 2009 international symposium on Memory management
Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities

International Journal of High Performance Computing Applications
In-Memory Checkpointing for MPI Programs by XOR-Based Double-Erasure Codes

Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Toward Exascale Resilience

International Journal of High Performance Computing Applications
PLFS: a checkpoint filesystem for parallel applications

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Optimal real number codes for fault tolerant matrix operations

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Reliability challenges in large systems

Future Generation Computer Systems
Design and performance evaluation of enhanced two-level recovery scheme

PDCN '08 Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks
An analysis of location record checkpointing interval for mobility database in PCS networks

Wireless Networks
Performance evaluation of the striped checkpointing algorithm on the distributed RAID for cluster computer

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartII
Diet: new developments and recent results

Euro-Par'06 Proceedings of the CoreGRID 2006, UNICORE Summit 2006, Petascale Computational Biology and Bioinformatics conference on Parallel processing
On checkpoint overhead in distributed systems providing session guarantees

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Relax: an architectural framework for software recovery of hardware faults

Proceedings of the 37th annual international symposium on Computer architecture
Distributed Diskless Checkpoint for Large Scale Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Adaptive and Speculative Slack Simulations of CMPs on CMPs

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Algorithm-based recovery for HPL

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Hybrid checkpointing using emerging nonvolatile memories for future exascale systems

ACM Transactions on Architecture and Code Optimization (TACO)
High performance linpack benchmark: a fault tolerant implementation without checkpointing

Proceedings of the international conference on Supercomputing
Algorithm-based recovery for iterative methods without checkpointing

Proceedings of the 20th international symposium on High performance distributed computing
Tolerating correlated failures for generalized Cartesian distributions via bipartite matching

Proceedings of the 8th ACM International Conference on Computing Frontiers
FTI: high performance fault tolerance interface for hybrid systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Analyzing fault aware collective performance in a process fault tolerant MPI

Parallel Computing
Adaptive mobile checkpointing facility for wireless sensor networks

ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part II
Robust distributed orthogonalization based on randomized aggregation

Proceedings of the second workshop on Scalable algorithms for large-scale systems
Fault tolerant matrix-matrix multiplication: correcting soft errors on-line

Proceedings of the second workshop on Scalable algorithms for large-scale systems
Algorithm-based fault tolerance for dense matrix factorizations

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Performance evaluation of consistent recovery protocols using MPICH-GF

EDCC'05 Proceedings of the 5th European conference on Dependable Computing
On the viability of checkpoint compression for extreme scale fault tolerance

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Energy-aware I/O optimization for checkpoint and restart on a NAND flash memory system

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Parallel reduction to hessenberg form with algorithm-based fault tolerance

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Evaluating energy savings for checkpoint/restart

E2SC '13 Proceedings of the 1st International Workshop on Energy Efficient Supercomputing
X10-FT: Transparent fault tolerance for APGAS language and runtime

Parallel Computing
Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems

Scientific Programming - Selected Papers from Super Computing 2012

Quantified Score

Hi-index	0.01

Visualization

Abstract

Diskless Checkpointing is a technique for checkpointing the state of a long-running computation on a distributed system without relying on stable storage. As such, it eliminates the performance bottleneck of traditional checkpointing on distributed systems. In this paper, we motivate diskless checkpointing and present the basic diskless checkpointing scheme along with several variants for improved performance. The performance of the basic scheme and its variants is evaluated on a high-performance network of workstations and compared to traditional disk-based checkpointing. We conclude that diskless checkpointing is a desirable alternative to disk-based checkpointing that can improve the performance of distributed applications in the face of failures.