A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Collective operations in application-level fault-tolerant MPI
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
A measurement study of available bandwidth estimation tools
Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement
Resource Policing to Support Fine-Grain Cycle Stealing in Networks of Workstations
IEEE Transactions on Parallel and Distributed Systems
Distributed computing in practice: the Condor experience: Research Articles
Concurrency and Computation: Practice & Experience - Grid Performance
Using Erasure Codes Efficiently for Storage in a Distributed System
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Awarded Best Student Paper! - Pond: The OceanStore Prototype
FAST '03 Proceedings of the 2nd USENIX Conference on File and Storage Technologies
Strategies for storage of checkpointing data using non-dedicated repositories on Grid systems
MGC '05 Proceedings of the 3rd international workshop on Middleware for grid computing
Empirical Studies on the Behavior of Resource Availability in Fine-Grained Cycle Sharing Systems
ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Failure-aware checkpointing in fine-grained cycle sharing systems
Proceedings of the 16th international symposium on High performance distributed computing
BioBench: A Benchmark Suite of Bioinformatics Applications
ISPASS '05 Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2005
Multi-state grid resource availability characterization
GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
Scheduling on the Grid via multi-state resource availability prediction
GRID '08 Proceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing
High availability in DHTs: erasure coding vs. replication
IPTPS'05 Proceedings of the 4th international conference on Peer-to-Peer Systems
McrEngine: a scalable checkpointing system using data-aware aggregation and compression
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
McrEngine: A scalable checkpointing system using data-aware aggregation and compression
Scientific Programming - Selected Papers from Super Computing 2012
Hi-index | 0.00 |
In Fine-Grained Cycle Sharing (FGCS) systems, machine owners voluntarily share their unused CPU cycles with guest jobs, as long as their performance degradation is tolerable. However, unpredictable evictions of guest jobs lead to fluctuating completion times. Checkpoint-recovery is an attractive mechanism for recovering from such "failures". Today's FGCS systems often use expensive, high-performance dedicated checkpoint servers. However, in geographically distributed clusters, this may incur high checkpoint transfer latencies. In this paper we present a system called Falcon that uses available disk resources of the FGCS machines as shared checkpoint repositories. However, an unavailable storage host may lead to loss of checkpoint data. Therefore, we model failures of storage hosts and develop a prediction algorithm for choosing reliable checkpoint repositories. We experiment with Falcon in the university-wide Condor testbed at Purdue and show improved and consistent performance for guest jobs in the presence of irregular resource availability.