FALCON: a system for reliable checkpoint recovery in shared grid environments

Authors:
Tanzima Zerin Islam;Saurabh Bagchi;Rudolf Eigenmann
Affiliations:
Purdue University, West Lafayette, IN;Purdue University, West Lafayette, IN;Purdue University, West Lafayette, IN
Venue:
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Year:
2009

Citing 14
Cited 2

A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Collective operations in application-level fault-tolerant MPI

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
A measurement study of available bandwidth estimation tools

Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement
Resource Policing to Support Fine-Grain Cycle Stealing in Networks of Workstations

IEEE Transactions on Parallel and Distributed Systems
Distributed computing in practice: the Condor experience: Research Articles

Concurrency and Computation: Practice & Experience - Grid Performance
Using Erasure Codes Efficiently for Storage in a Distributed System

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Awarded Best Student Paper! - Pond: The OceanStore Prototype

FAST '03 Proceedings of the 2nd USENIX Conference on File and Storage Technologies
Strategies for storage of checkpointing data using non-dedicated repositories on Grid systems

MGC '05 Proceedings of the 3rd international workshop on Middleware for grid computing
Empirical Studies on the Behavior of Resource Availability in Fine-Grained Cycle Sharing Systems

ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Failure-aware checkpointing in fine-grained cycle sharing systems

Proceedings of the 16th international symposium on High performance distributed computing
BioBench: A Benchmark Suite of Bioinformatics Applications

ISPASS '05 Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2005
Multi-state grid resource availability characterization

GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
Scheduling on the Grid via multi-state resource availability prediction

GRID '08 Proceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing
High availability in DHTs: erasure coding vs. replication

IPTPS'05 Proceedings of the 4th international conference on Peer-to-Peer Systems

McrEngine: a scalable checkpointing system using data-aware aggregation and compression

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
McrEngine: A scalable checkpointing system using data-aware aggregation and compression

Scientific Programming - Selected Papers from Super Computing 2012

Quantified Score

Hi-index	0.00

Visualization

Abstract

In Fine-Grained Cycle Sharing (FGCS) systems, machine owners voluntarily share their unused CPU cycles with guest jobs, as long as their performance degradation is tolerable. However, unpredictable evictions of guest jobs lead to fluctuating completion times. Checkpoint-recovery is an attractive mechanism for recovering from such "failures". Today's FGCS systems often use expensive, high-performance dedicated checkpoint servers. However, in geographically distributed clusters, this may incur high checkpoint transfer latencies. In this paper we present a system called Falcon that uses available disk resources of the FGCS machines as shared checkpoint repositories. However, an unavailable storage host may lead to loss of checkpoint data. Therefore, we model failures of storage hosts and develop a prediction algorithm for choosing reliable checkpoint repositories. We experiment with Falcon in the university-wide Condor testbed at Purdue and show improved and consistent performance for guest jobs in the presence of irregular resource availability.