Architecture-based fault tolerance support for grid applications

Authors:
Iman I. Yusuf;Heinz W. Schmidt;Ian D. Peake
Affiliations:
RMIT University, Melbourne, Australia;RMIT University, Melbourne, Australia;RMIT University, Melbourne, Australia
Venue:
Proceedings of the joint ACM SIGSOFT conference -- QoSA and ACM SIGSOFT symposium -- ISARCS on Quality of software architectures -- QoSA and architecting critical systems -- ISARCS
Year:
2011

Citing 16
Cited 1

Analyzing scalability of parallel algorithms and architectures

Journal of Parallel and Distributed Computing - Special issue on scalability of parallel algorithms and architectures
The Unified Modeling Language user guide

The Unified Modeling Language user guide
On the Optimum Checkpoint Interval

Journal of the ACM (JACM)
Component Software: Beyond Object-Oriented Programming

Component Software: Beyond Object-Oriented Programming
Experimental Assessment of Workstation Failures and Their Impact on Checkpointing Systems

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
A Fault Detection Service for Wide Area Distributed Computations

HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
Trustworthy components-compositionality and prediction

Journal of Systems and Software - Special issue on: Component-based software engineering
Reliability prediction for component-based software architectures

Journal of Systems and Software - Special issue on: Software architecture - Engineering quality attributes
The Grid 2: Blueprint for a New Computing Infrastructure

The Grid 2: Blueprint for a New Computing Infrastructure
Recovery-Oriented Computing: Building Multitier Dependability

Computer
Fault-tolerant grid services using primary-backup: feasibility and performance

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Taverna: lessons in creating a workflow environment for the life sciences: Research Articles

Concurrency and Computation: Practice & Experience - Workflow in Grid Systems
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids

IEEE Transactions on Parallel and Distributed Systems
Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Evaluating recovery aware components for grid reliability

Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering

Parameterised architectural patterns for providing cloud service fault tolerance with accurate costings

Proceedings of the 16th International ACM Sigsoft symposium on Component-based software engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Failure in long running grid applications is arguably inevitable and costly. Therefore, fault tolerance (FT) support for grid applications is needed. This paper evaluates an extension of our prior work on Recovery Aware Components (RAC), a component based FT approach. Our extension utilizes the grid application architecture according to a small number of architectural classes. In this paper, we evaluate the MapReduce architecture only and analyze the reliability improvement MapReduce applications would gain by adopting the RAC approach. Our analysis shows that significant increases in reliability are possible at moderate extra cost. Obviously the cost of FT depends on the failure rate of the managed system, i.e., the system to be protected from faults, and the FT strategy chosen. Our work aims to give High Performance Computing (HPC) software architects the tools to control these factors for dierent grid application architectures.