Performance evaluation of an application-level checkpointing solution on grids
Future Generation Computer Systems
Failure recovery mechanism in neighbor replica distribution architecture
ICICA'10 Proceedings of the First international conference on Information computing and applications
A fault avoidance strategy improving the reliability of the EGI production grid infrastructure
OPODIS'10 Proceedings of the 14th international conference on Principles of distributed systems
Reputation-based resource allocation in market-oriented distributed systems
ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part I
Towards autonomic management for Cloud services based upon volunteered resources
Concurrency and Computation: Practice & Experience
A Cost-Effective Mechanism for Cloud Data Reliability Management Based on Proactive Replica Checking
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Link repair in managed multi-domain connections with end-to-end quality guarantees
International Journal of Network Management
Reliability Based Scheduling Model RSM for Computational Grids
International Journal of Distributed Systems and Technologies
A survey on reliability in distributed systems
Journal of Computer and System Sciences
Reliable workflow scheduling with less resource redundancy
Parallel Computing
MORM: A Multi-objective Optimized Replication Management strategy for cloud storage cluster
Journal of Systems Architecture: the EUROMICRO Journal
Distributed workflow mapping algorithm for maximized reliability under end-to-end delay constraint
The Journal of Supercomputing
Hi-index | 0.00 |
In recent years, grid technology has emerged as an important tool for solving compute-intensive problems within the scientific community and in industry. To further the development and adoption of this technology, researchers and practitioners from different disciplines have collaborated to produce standard specifications for implementing large-scale, interoperable grid systems. The focus of this activity has been the Open Grid Forum, but other standards development organizations have also produced specifications that are used in grid systems. To date, these specifications have provided the basis for a growing number of operational grid systems used in scientific and industrial applications. However, if the growth of grid technology is to continue, it will be important that grid systems also provide high reliability. In particular, it will be critical to ensure that grid systems are reliable as they continue to grow in scale, exhibit greater dynamism, and become more heterogeneous in composition. Ensuring grid system reliability in turn requires that the specifications used to build these systems fully support reliable grid services. This study surveys work on grid reliability that has been done in recent years and reviews progress made toward achieving these goals. The survey identifies important issues and problems that researchers are working to overcome in order to develop reliability methods for large-scale, heterogeneous, dynamic environments. The survey also illuminates reliability issues relating to standard specifications used in grid systems, identifying existing specifications that may need to be evolved and areas where new specifications are needed to better support the reliability. Published in 2009 by John Wiley & Sons, Ltd. This article is a U.S. Government work and is in the public domain in the U.S.A.