Failure-aware checkpointing in fine-grained cycle sharing systems

Authors:
Xiaojuan Ren;Rudolf Eigenmann;Saurabh Bagchi
Affiliations:
Purdue University;Purdue University;Purdue University
Venue:
Proceedings of the 16th international symposium on High performance distributed computing
Year:
2007

Citing 20
Cited 4

Combinatorial optimization: algorithms and complexity

Combinatorial optimization: algorithms and complexity
Efficient dispersal of information for security, load balancing, and fault tolerance

Journal of the ACM (JACM)
Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme

IEEE Transactions on Computers
Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
The network weather service: a distributed resource performance forecasting service for metacomputing

Future Generation Computer Systems - Special issue on metacomputing
A Variational Calculus Approach to Optimal Checkpoint Placement

IEEE Transactions on Computers
Sun Grid Engine: Towards Creating a Compute Power Grid

CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
Experimental Assessment of Workstation Failures and Their Impact on Checkpointing Systems

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
An Evaluation of Linear Models for Host Load Prediction

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Managing Network Resources in Condor

HPDC '00 Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing
The Kangaroo Approach to Data Movement on the Grid

HPDC '01 Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing
Resource Policing to Support Fine-Grain Cycle Stealing in Networks of Workstations

IEEE Transactions on Parallel and Distributed Systems
Distributed computing in practice: the Condor experience: Research Articles

Concurrency and Computation: Practice & Experience - Grid Performance
Using Erasure Codes Efficiently for Storage in a Distributed System

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Strategies for storage of checkpointing data using non-dedicated repositories on Grid systems

MGC '05 Proceedings of the 3rd international workshop on Middleware for grid computing
Optimal Resilience for Erasure-Coded Byzantine Distributed Storage

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Empirical Studies on the Behavior of Resource Availability in Fine-Grained Cycle Sharing Systems

ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Performance implications of failures in large-scale cluster scheduling

JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Wave scheduler: scheduling for faster turnaround time in peer-based desktop grid systems

JSSPP'05 Proceedings of the 11th international conference on Job Scheduling Strategies for Parallel Processing
iShare – open internet sharing built on peer-to-peer and web

EGC'05 Proceedings of the 2005 European conference on Advances in Grid Computing

Grid workflow scheduling based on reliability cost

Proceedings of the 2nd international conference on Scalable information systems
Taking snapshots of virtual networked environments

VTDC '07 Proceedings of the 2nd international workshop on Virtualization technology in distributed computing
An analysis of clustered failures on large supercomputing systems

Journal of Parallel and Distributed Computing
FALCON: a system for reliable checkpoint recovery in shared grid environments

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Fine-Grained Cycle Sharing (FGCS) systems aim at utilizing the large amountof idle computational resources available on the Internet. Such systems allow guest jobs to run on a host if they do not significantly impact the local users of the host. Since the hosts are typically provided voluntarily, their availability fluctuates greatly. To provide fault tolerance to guest jobs without adding significant computational overhead, we propose failure-aware checkpointing techniques that apply the knowledge of resource availability to select checkpoint repositories and to determine checkpoint intervals. We present the schemes of selecting reliable and efficient repositories from the non-dedicated hosts that contribute their disk storage. These schemes are formulated as 0/1 programming problems to optimize the network overhead of transferring checkpoints and the work lost due to unavailability of a storage host when needed to recover a guest job. We determine the checkpoint interval by comparing the cost of checkpointing immediately and the cost of delaying that to a later time, which is a function of the resource availability. We evaluate these techniques on an FGCS system called iShare, using trace-based simulation. The results show that they achieve better application performance than the prevalent methods which use checkpointing with a fixed periodicity on dedicated checkpoint servers.