Distributed Shared Memory: A Survey of Issues and Algorithms
Computer - Distributed computing systems: separate resources acting as one
A Survey of Distributed Database Checkpointing
Distributed and Parallel Databases
The state of the art in distributed query processing
ACM Computing Surveys (CSUR)
Simple object access protocol (SOAP) and Web services
ICSE '01 Proceedings of the 23rd International Conference on Software Engineering
Fault-Tolerance Using Cache-Coherent Distributed Shared Memory Systems
ISPAN '99 Proceedings of the 1999 International Symposium on Parallel Architectures, Algorithms and Networks
Checkpointing-based rollback recovery for parallel applications on the InteGrade grid middleware
MGC '04 Proceedings of the 2nd workshop on Middleware for grid computing
A planning based approach to failure recovery in distributed systems
WOSS '04 Proceedings of the 1st ACM SIGSOFT workshop on Self-managed systems
Fault-Tolerance in Distributed Query Processing
IDEAS '05 Proceedings of the 9th International Database Engineering & Application Symposium
Integrating coherency and recoverability in distributed systems
OSDI '94 Proceedings of the 1st USENIX conference on Operating Systems Design and Implementation
A replication-based fault tolerance protocol using group communication for the grid
ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications
Hi-index | 0.00 |
Data federation systems virtualize access to enterprize data resources by integrating data from disparate and heterogeneous operational data sources in an on-demand and real-time basis. The key challenge of low latency data access in such real-time data federation systems can be addressed by grid based scale-out architecture. However, failure of resources in the grid can pose serious challenges in data federation as the query processing is federated over multiple grid nodes. In such real-time data federation systems, it is often desirable to recover from failure and continue operation rather than repeat the entire process. This paper proposes a decentralized failure-recovery protocol for data federation system using data spaces based architecture. The generic nature of the protocol makes it extensible to applications other than data federation system as well. Moreover, the protocol does not make any assumptions about the availability of any central repository for recovering from failure. We implement the proposed failure recovery protocol in a simulation environment and present the key findings of the study.