A taxonomy of scientific workflow systems for grid computing
ACM SIGMOD Record
Fault-tolerant grid resource management infrastructure
Neural, Parallel & Scientific Computations - Special issue: Grid computing
Reliability-Aware Resource Management for Computational Grid/Cluster Environments
GRID '05 Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing
Error recovery mechanism for grid-based workflow within SLA context
International Journal of High Performance Computing and Networking
Grid workflow scheduling based on reliability cost
Proceedings of the 2nd international conference on Scalable information systems
International Journal of Web and Grid Services
Dynasa: adapting grid applications to safety using fault-tolerant methods
HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
BPEL4Job: A Fault-Handling Design for Job Flow Management
ICSOC '07 Proceedings of the 5th international conference on Service-Oriented Computing
Grid Application Fault Diagnosis Using Wrapper Services and Machine Learning
ICSOC '07 Proceedings of the 5th international conference on Service-Oriented Computing
Exception handling patterns for hierarchical scientific workflows
Proceedings of the 6th international workshop on Middleware for grid computing
Realtime-enabled workflow management in service oriented infrastructures
AREA '08 Proceedings of the 1st ACM workshop on Analysis and retrieval of events/actions and workflows in video streams
VRM: a failure-aware Grid resource management system
International Journal of High Performance Computing and Networking
Scientific workflow design for mere mortals
Future Generation Computer Systems
Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Future Generation Computer Systems
An adaptive task-level fault-tolerant approach to Grid
The Journal of Supercomputing
On grid performance evaluation using synthetic workloads
JSSPP'06 Proceedings of the 12th international conference on Job scheduling strategies for parallel processing
A flexible job scheduling system for heterogeneous grids
APPT'07 Proceedings of the 7th international conference on Advanced parallel processing technologies
Grid computing: experiment management, tool integration, and scientific workflows
Grid computing: experiment management, tool integration, and scientific workflows
Parameter sweeping methodology for integration in a workflow specification framework
ICCSA'07 Proceedings of the 2007 international conference on Computational science and its applications - Volume Part I
Design and evaluation of a self-healing Kepler for scientific workflows
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Performance evaluation of fault tolerance techniques in grid computing system
Computers and Electrical Engineering
Failure-aware workflow scheduling in cluster environments
Cluster Computing
A fault avoidance strategy improving the reliability of the EGI production grid infrastructure
OPODIS'10 Proceedings of the 14th international conference on Principles of distributed systems
A hybrid fault tolerance technique in grid computing system
The Journal of Supercomputing
Fault-tolerant dynamic job scheduling policy
ICA3PP'05 Proceedings of the 6th international conference on Algorithms and Architectures for Parallel Processing
Characterizing quality of resilience in scientific workflows
Proceedings of the 6th workshop on Workflows in support of large-scale science
The role of agreements in IT management software
Architecting Dependable Systems III
Framework for enabling highly available distributed applications for utility computing
ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications
Robust parallel job scheduling infrastructure for service-oriented grid computing systems
ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part IV
Replication based fault tolerant job scheduling strategy for economy driven grid
The Journal of Supercomputing
Dependable Grid Workflow Scheduling Based on Resource Availability
Journal of Grid Computing
Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems
A survey on reliability in distributed systems
Journal of Computer and System Sciences
Hi-index | 0.00 |
The generic, heterogeneous, and dynamic nature of the Grid requires a new form of failure recovery mechanism to address its unique requirements such as support for diverse failure handling strategies, separation of failure handling strategies from application codes, and user-defined exception handling. We here propose a Grid Workflow System (Grid-WFS), a flexible failure handling framework for the Grid, which addresses these Grid-unique failure recovery requirements. Central to the framework is flexibility in handing failures. We describe how to achieve the flexibility by the use of workflow structure as a high-level recovery policy specification. We show how this use of high-level workflow structure allows users to achieve failure recovery in a variety of ways depending on the requirements and constraints of their applications. We also demonstrate that this use of workflow structure enables users to not only rapidly prototype and investigate failure handling strategies, but also easily change them by simply modifying the encompassing workflow structure, while the application code remainsintact. Finally, we present an experimental evaluation of our framework using a simulation, demonstrating the value of supporting multiple failure recovery techniques in Grid systems to achieve high performance in the presence of failures.