Parameterised architectural patterns for providing cloud service fault tolerance with accurate costings

Authors:
Iman I. Yusuf;Heinz W. Schmidt
Affiliations:
RMIT University, Melbourne, Australia;RMIT University, Melbourne, Australia
Venue:
Proceedings of the 16th International ACM Sigsoft symposium on Component-based software engineering
Year:
2013

Citing 22
Cited 0

A bridging model for parallel computation

Communications of the ACM
TGFF: task graphs for free

Proceedings of the 6th international workshop on Hardware/software codesign
The grid: blueprint for a new computing infrastructure

The grid: blueprint for a new computing infrastructure
A fault detection service for wide area distributed computations

Cluster Computing
The Anatomy of the Grid: Enabling Scalable Virtual Organizations

International Journal of High Performance Computing Applications
Exploit Failure Prediction for Adaptive Fault-Tolerance in Cluster Computing

CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
Taverna: lessons in creating a workflow environment for the life sciences: Research Articles

Concurrency and Computation: Practice & Experience - Workflow in Grid Systems
Grid-Based Data Stream Processing in e-Science

E-SCIENCE '06 Proceedings of the Second IEEE International Conference on e-Science and Grid Computing
Map-reduce-merge: simplified relational data processing on large clusters

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Service and Utility Oriented Distributed Computing Systems: Challenges and Opportunities for Modeling and Simulation Communities

ANSS-41 '08 Proceedings of the 41st Annual Simulation Symposium (anss-41 2008)
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids

IEEE Transactions on Parallel and Distributed Systems
Can cloud computing reach the top500?

Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop
A view of the parallel computing landscape

Communications of the ACM - A View of Parallel Computing
Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Evaluating recovery aware components for grid reliability

Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering
Component-based stream processing "in the cloud"

Proceedings of the 2009 Workshop on Component-Based High Performance Computing
A bridging model for multi-core computing

Journal of Computer and System Sciences
Architecture-based fault tolerance support for grid applications

Proceedings of the joint ACM SIGSOFT conference -- QoSA and ACM SIGSOFT symposium -- ISARCS on Quality of software architectures -- QoSA and architecting critical systems -- ISARCS
Compiler techniques for scalable performance of stream programs on multicore architectures

Compiler techniques for scalable performance of stream programs on multicore architectures

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cloud computing presents a unique opportunity for science and engineering with benefits compared to traditional high-performance computing, especially for smaller compute jobs and entry-level users to parallel computing. However, doubts remain for production high-performance computing in the cloud, the so-called science cloud, as predictable performance, reliability and therefore costs remain elusive for many applications. This paper uses parameterised architectural patterns to assist with fault tolerance and cost predictions for science clouds, in which a single job typically holds many virtual machines for a long time, communication can involve massive data movements, and buffered streams allow parallel processing to proceed while data transfers are still incomplete. We utilise predictive models, simulation and actual runs to estimate run times with acceptable accuracy for two of the most common architectural patterns for data-intensive scientific computing: MapReduce and Combinational Logic. Run times are fundamental to understand fee-for-service costs of clouds. These are typically charged by the hour and the number of compute nodes or cores used. We evaluate our models using realistic cloud experiments from collaborative physics research projects and show that proactive and reactive fault tolerance is manageable, predictable and composable, in principle, especially at the architectural level.