Combinatorial optimization: algorithms and complexity
Combinatorial optimization: algorithms and complexity
Efficient dispersal of information for security, load balancing, and fault tolerance
Journal of the ACM (JACM)
Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme
IEEE Transactions on Computers
IEEE Transactions on Parallel and Distributed Systems
Future Generation Computer Systems - Special issue on metacomputing
A Variational Calculus Approach to Optimal Checkpoint Placement
IEEE Transactions on Computers
Sun Grid Engine: Towards Creating a Compute Power Grid
CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
Experimental Assessment of Workstation Failures and Their Impact on Checkpointing Systems
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
An Evaluation of Linear Models for Host Load Prediction
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Managing Network Resources in Condor
HPDC '00 Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing
The Kangaroo Approach to Data Movement on the Grid
HPDC '01 Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing
Resource Policing to Support Fine-Grain Cycle Stealing in Networks of Workstations
IEEE Transactions on Parallel and Distributed Systems
Distributed computing in practice: the Condor experience: Research Articles
Concurrency and Computation: Practice & Experience - Grid Performance
Using Erasure Codes Efficiently for Storage in a Distributed System
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Strategies for storage of checkpointing data using non-dedicated repositories on Grid systems
MGC '05 Proceedings of the 3rd international workshop on Middleware for grid computing
Optimal Resilience for Erasure-Coded Byzantine Distributed Storage
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Empirical Studies on the Behavior of Resource Availability in Fine-Grained Cycle Sharing Systems
ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Performance implications of failures in large-scale cluster scheduling
JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Wave scheduler: scheduling for faster turnaround time in peer-based desktop grid systems
JSSPP'05 Proceedings of the 11th international conference on Job Scheduling Strategies for Parallel Processing
iShare – open internet sharing built on peer-to-peer and web
EGC'05 Proceedings of the 2005 European conference on Advances in Grid Computing
Grid workflow scheduling based on reliability cost
Proceedings of the 2nd international conference on Scalable information systems
Taking snapshots of virtual networked environments
VTDC '07 Proceedings of the 2nd international workshop on Virtualization technology in distributed computing
An analysis of clustered failures on large supercomputing systems
Journal of Parallel and Distributed Computing
FALCON: a system for reliable checkpoint recovery in shared grid environments
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Hi-index | 0.00 |
Fine-Grained Cycle Sharing (FGCS) systems aim at utilizing the large amountof idle computational resources available on the Internet. Such systems allow guest jobs to run on a host if they do not significantly impact the local users of the host. Since the hosts are typically provided voluntarily, their availability fluctuates greatly. To provide fault tolerance to guest jobs without adding significant computational overhead, we propose failure-aware checkpointing techniques that apply the knowledge of resource availability to select checkpoint repositories and to determine checkpoint intervals. We present the schemes of selecting reliable and efficient repositories from the non-dedicated hosts that contribute their disk storage. These schemes are formulated as 0/1 programming problems to optimize the network overhead of transferring checkpoints and the work lost due to unavailability of a storage host when needed to recover a guest job. We determine the checkpoint interval by comparing the cost of checkpointing immediately and the cost of delaying that to a later time, which is a function of the resource availability. We evaluate these techniques on an FGCS system called iShare, using trace-based simulation. The results show that they achieve better application performance than the prevalent methods which use checkpointing with a fixed periodicity on dedicated checkpoint servers.