A bridging model for parallel computation
Communications of the ACM
Proceedings of the 6th international workshop on Hardware/software codesign
The grid: blueprint for a new computing infrastructure
The grid: blueprint for a new computing infrastructure
A fault detection service for wide area distributed computations
Cluster Computing
The Anatomy of the Grid: Enabling Scalable Virtual Organizations
International Journal of High Performance Computing Applications
Exploit Failure Prediction for Adaptive Fault-Tolerance in Cluster Computing
CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
Taverna: lessons in creating a workflow environment for the life sciences: Research Articles
Concurrency and Computation: Practice & Experience - Workflow in Grid Systems
Grid-Based Data Stream Processing in e-Science
E-SCIENCE '06 Proceedings of the Second IEEE International Conference on e-Science and Grid Computing
Map-reduce-merge: simplified relational data processing on large clusters
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Dryad: distributed data-parallel programs from sequential building blocks
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
ANSS-41 '08 Proceedings of the 41st Annual Simulation Symposium (anss-41 2008)
SCOPE: easy and efficient parallel processing of massive data sets
Proceedings of the VLDB Endowment
Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids
IEEE Transactions on Parallel and Distributed Systems
Can cloud computing reach the top500?
Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop
A view of the parallel computing landscape
Communications of the ACM - A View of Parallel Computing
Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Evaluating recovery aware components for grid reliability
Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering
Component-based stream processing "in the cloud"
Proceedings of the 2009 Workshop on Component-Based High Performance Computing
A bridging model for multi-core computing
Journal of Computer and System Sciences
Architecture-based fault tolerance support for grid applications
Proceedings of the joint ACM SIGSOFT conference -- QoSA and ACM SIGSOFT symposium -- ISARCS on Quality of software architectures -- QoSA and architecting critical systems -- ISARCS
Compiler techniques for scalable performance of stream programs on multicore architectures
Compiler techniques for scalable performance of stream programs on multicore architectures
Hi-index | 0.00 |
Cloud computing presents a unique opportunity for science and engineering with benefits compared to traditional high-performance computing, especially for smaller compute jobs and entry-level users to parallel computing. However, doubts remain for production high-performance computing in the cloud, the so-called science cloud, as predictable performance, reliability and therefore costs remain elusive for many applications. This paper uses parameterised architectural patterns to assist with fault tolerance and cost predictions for science clouds, in which a single job typically holds many virtual machines for a long time, communication can involve massive data movements, and buffered streams allow parallel processing to proceed while data transfers are still incomplete. We utilise predictive models, simulation and actual runs to estimate run times with acceptable accuracy for two of the most common architectural patterns for data-intensive scientific computing: MapReduce and Combinational Logic. Run times are fundamental to understand fee-for-service costs of clouds. These are typically charged by the hour and the number of compute nodes or cores used. We evaluate our models using realistic cloud experiments from collaborative physics research projects and show that proactive and reactive fault tolerance is manageable, predictable and composable, in principle, especially at the architectural level.