Strategies for storage of checkpointing data using non-dedicated repositories on Grid systems

Authors:
Raphael Y. de Camargo;Renato Cerqueira;Fabio Kon
Affiliations:
University of São Paulo, Brazil;PUC-Rio, Brazil;University of São Paulo, Brazil
Venue:
MGC '05 Proceedings of the 3rd international workshop on Middleware for grid computing
Year:
2005

Citing 15
Cited 4

Efficient dispersal of information for security, load balancing, and fault tolerance

Journal of the ACM (JACM)
A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems

Software—Practice & Experience
Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
Coding for High Availability of a Distributed-Parallel Storage System

IEEE Transactions on Parallel and Distributed Systems
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Secure Distributed Storage and Retrieval

WDAG '97 Proceedings of the 11th International Workshop on Distributed Algorithms
Managing Checkpoints for Parallel Programs

IPPS '96 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
A longitudinal survey of Internet host reliability

SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
Stable Checkpointing in Distributed Systems without Shared Disks

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Grid Computing: Making the Global Infrastructure a Reality

Grid Computing: Making the Global Infrastructure a Reality
The Grid 2: Blueprint for a New Computing Infrastructure

The Grid 2: Blueprint for a New Computing Infrastructure
A survey of peer-to-peer content distribution technologies

ACM Computing Surveys (CSUR)
DISP: Practical, efficient, secure and fault-tolerant distributed data storage

ACM Transactions on Storage (TOS)
InteGrade object-oriented Grid middleware leveraging the idle computing power of desktop machines: Research Articles

Concurrency and Computation: Practice & Experience - Middleware for Grid Computing
Portable checkpointing and communication for BSP applications on dynamic heterogeneous Grid environments

SBAC-PAD '05 Proceedings of the 17th International Symposium on Computer Architecture on High Performance Computing

Strategies for Checkpoint Storage on Opportunistic Grids

IEEE Distributed Systems Online
Realizing the e-science desktop peer using a peer-to-peer distributed virtual machine middleware

Proceedings of the 4th international workshop on Middleware for grid computing
Failure-aware checkpointing in fine-grained cycle sharing systems

Proceedings of the 16th international symposium on High performance distributed computing
FALCON: a system for reliable checkpoint recovery in shared grid environments

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Dealing with the large amounts of data generated by long-running parallel applications is one of the most challenging aspects of Grid Computing. Periodic checkpoints might be taken to guarantee application progression, producing even more data. The classical approach is to employ high-throughput checkpoint servers connected to the computational nodes by high speed networks. In the case of Opportunistic Grid Computing, we do not want to be forced to rely on such dedicated hardware. Instead, we want to use the shared Grid nodes to store application data in a distributed fashion.In this work, we evaluate several strategies to store checkpoints on distributed non-dedicated repositories. We consider the tradeoff among computational overhead, storage overhead, and degree of fault-tolerance of these strategies. We compare the use of replication, parity information, and information dispersal (IDA). We used InteGrade, an object-oriented Grid middleware, to implement the storage strategies and perform evaluation experiments.