Distributed Checkpointing on Clusters with Dynamic Striping and Staggering

Authors:
Hai Jin;Kai Hwang
Affiliations:
-;-
Venue:
ASIAN '02 Proceedings of the7th Asian Computing Science Conference on Advances in Computing Science: Internet Computing and Modeling, Grid Computing, Peer-to-Peer Computing, and Cluster
Year:
2002

Citing 18
Cited 2

Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Designing disk arrays for high data reliability

Journal of Parallel and Distributed Computing - Special issue on parallel I/O systems
Reliability analysis of redundant arrays of inexpensive disks

Journal of Parallel and Distributed Computing - Special issue on parallel I/O systems
Checkpointing and rollback-recovery algorithms in distributed systems

Journal of Systems and Software - Special issue on fault tolerance in real-time systems
Necessary and Sufficient Conditions for Consistent Global Snapshots

IEEE Transactions on Parallel and Distributed Systems
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
A case for two-level distributed recovery schemes

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems

IEEE Transactions on Parallel and Distributed Systems
Parallel programming: techniques and applications using networked workstations and parallel computers

Parallel programming: techniques and applications using networked workstations and parallel computers
Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
On Coordinated Checkpointing in Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
Staggered Consistent Checkpointing

IEEE Transactions on Parallel and Distributed Systems
Orthogonal Striping and Mirroring in Distributed RAID for I/O-Centric Cluster Computing

IEEE Transactions on Parallel and Distributed Systems
Designing SSI Clusters with Hierarchical Checkpointing and Single I/O Space

IEEE Concurrency
An Efficient Protocol for Checkpointing Recovery in Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
Low-Latency, Concurrent Checkpointing for Parallel Programs

IEEE Transactions on Parallel and Distributed Systems
Reliable Cluster Computing with a New Checkpointing RAID-x Architecture

HCW '00 Proceedings of the 9th Heterogeneous Computing Workshop
Libckpt: transparent checkpointing under Unix

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings

A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

Information Sciences: an International Journal
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a new striped and staggered checkpointing (SSC) scheme for multicomputer clusters. We consider serverless clusters, where local disks attached to cluster nodes collectively form a distributed RAID (redundant array of inexpensive disks) with a single I/O space. The distributed RAID is used to save the checkpoint files periodically. Striping enables parallel I/O on distributed disks. Staggering avoids network bottleneck in distributed disk I/O operations. With a fixed cluster size, we reveal the tradeoffs between these two speedup techniques. Our SSC approach allows dynamical reconfiguration to minimize message-logging requirements among concurrent software processes. We demonstrate how to reduce the checkpointing overhead by striping and staggering dynamically. For communication-intensive programs, our SCC scheme can significantly reduce the checkpointing overhead. Benchmark results prove the benefits of trading between stripe parallelism and distributed staggering. These results are useful to design efficient checkpointing schemes for fast rollback recovery from any single node (disk) failure in a cluster of computers.