Analyzing scalability of parallel algorithms and architectures
Journal of Parallel and Distributed Computing - Special issue on scalability of parallel algorithms and architectures
The Unified Modeling Language user guide
The Unified Modeling Language user guide
On the Optimum Checkpoint Interval
Journal of the ACM (JACM)
Component Software: Beyond Object-Oriented Programming
Component Software: Beyond Object-Oriented Programming
Experimental Assessment of Workstation Failures and Their Impact on Checkpointing Systems
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
A Fault Detection Service for Wide Area Distributed Computations
HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
Trustworthy components-compositionality and prediction
Journal of Systems and Software - Special issue on: Component-based software engineering
Reliability prediction for component-based software architectures
Journal of Systems and Software - Special issue on: Software architecture - Engineering quality attributes
The Grid 2: Blueprint for a New Computing Infrastructure
The Grid 2: Blueprint for a New Computing Infrastructure
Fault-tolerant grid services using primary-backup: feasibility and performance
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Taverna: lessons in creating a workflow environment for the life sciences: Research Articles
Concurrency and Computation: Practice & Experience - Workflow in Grid Systems
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids
IEEE Transactions on Parallel and Distributed Systems
Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Evaluating recovery aware components for grid reliability
Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering
Proceedings of the 16th International ACM Sigsoft symposium on Component-based software engineering
Hi-index | 0.00 |
Failure in long running grid applications is arguably inevitable and costly. Therefore, fault tolerance (FT) support for grid applications is needed. This paper evaluates an extension of our prior work on Recovery Aware Components (RAC), a component based FT approach. Our extension utilizes the grid application architecture according to a small number of architectural classes. In this paper, we evaluate the MapReduce architecture only and analyze the reliability improvement MapReduce applications would gain by adopting the RAC approach. Our analysis shows that significant increases in reliability are possible at moderate extra cost. Obviously the cost of FT depends on the failure rate of the managed system, i.e., the system to be protected from faults, and the FT strategy chosen. Our work aims to give High Performance Computing (HPC) software architects the tools to control these factors for dierent grid application architectures.