Efficient dispersal of information for security, load balancing, and fault tolerance
Journal of the ACM (JACM)
A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems
Software—Practice & Experience
IEEE Transactions on Parallel and Distributed Systems
Coding for High Availability of a Distributed-Parallel Storage System
IEEE Transactions on Parallel and Distributed Systems
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Secure Distributed Storage and Retrieval
WDAG '97 Proceedings of the 11th International Workshop on Distributed Algorithms
Managing Checkpoints for Parallel Programs
IPPS '96 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
A longitudinal survey of Internet host reliability
SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
Stable Checkpointing in Distributed Systems without Shared Disks
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Grid Computing: Making the Global Infrastructure a Reality
Grid Computing: Making the Global Infrastructure a Reality
The Grid 2: Blueprint for a New Computing Infrastructure
The Grid 2: Blueprint for a New Computing Infrastructure
A survey of peer-to-peer content distribution technologies
ACM Computing Surveys (CSUR)
DISP: Practical, efficient, secure and fault-tolerant distributed data storage
ACM Transactions on Storage (TOS)
Concurrency and Computation: Practice & Experience - Middleware for Grid Computing
SBAC-PAD '05 Proceedings of the 17th International Symposium on Computer Architecture on High Performance Computing
Strategies for Checkpoint Storage on Opportunistic Grids
IEEE Distributed Systems Online
Realizing the e-science desktop peer using a peer-to-peer distributed virtual machine middleware
Proceedings of the 4th international workshop on Middleware for grid computing
Failure-aware checkpointing in fine-grained cycle sharing systems
Proceedings of the 16th international symposium on High performance distributed computing
FALCON: a system for reliable checkpoint recovery in shared grid environments
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Hi-index | 0.00 |
Dealing with the large amounts of data generated by long-running parallel applications is one of the most challenging aspects of Grid Computing. Periodic checkpoints might be taken to guarantee application progression, producing even more data. The classical approach is to employ high-throughput checkpoint servers connected to the computational nodes by high speed networks. In the case of Opportunistic Grid Computing, we do not want to be forced to rely on such dedicated hardware. Instead, we want to use the shared Grid nodes to store application data in a distributed fashion.In this work, we evaluate several strategies to store checkpoints on distributed non-dedicated repositories. We consider the tradeoff among computational overhead, storage overhead, and degree of fault-tolerance of these strategies. We compare the use of replication, parity information, and information dispersal (IDA). We used InteGrade, an object-oriented Grid middleware, to implement the storage strategies and perform evaluation experiments.