The state of the art in distributed query processing
ACM Computing Surveys (CSUR)
Kepler: An Extensible System for Design and Execution of Scientific Workflows
SSDBM '04 Proceedings of the 16th International Conference on Scientific and Statistical Database Management
A taxonomy of scientific workflow systems for grid computing
ACM SIGMOD Record
Planning spatial workflows to optimize grid performance
Proceedings of the 2006 ACM symposium on Applied computing
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Pegasus: A framework for mapping complex scientific workflows onto distributed systems
Scientific Programming
Performance metrics and ontologies for Grid workflows
Future Generation Computer Systems
Future Generation Computer Systems
Flexible and Efficient Workflow Deployment of Data-Intensive Applications On Grids With MOTEUR
International Journal of High Performance Computing Applications
The design and implementation of OGSA-DQP: A service-based distributed query processor
Future Generation Computer Systems
Workflows and e-Science: An overview of workflow system features and capabilities
Future Generation Computer Systems
Communications of the ACM - A Blind Person's Interaction with Technology
A distributed architecture for data mining and integration
Proceedings of the second international workshop on Data-aware distributed computing
Automating Gene Expression Annotation for Mouse Embryo
ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Introduction to performance metrics
Dependability metrics
Scaling up workflow-based applications
Journal of Computer and System Sciences
Adaptive rate stream processing for smart grid applications on clouds
Proceedings of the 2nd international workshop on Scientific cloud computing
Hirundo: a mechanism for automated production of optimized data stream graphs
ICPE '12 Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering
Automatic optimization of stream programs via source program operator graph transformations
Distributed and Parallel Databases
Modeling and optimizing large-scale data flows
Future Generation Computer Systems
Hi-index | 0.00 |
Modern scientific collaborations have opened up the opportunity of solving complex problems that involve multi-disciplinary expertise and large-scale computational experiments. These experiments usually involve large amounts of data that are located in distributed data repositories running various software systems, and managed by different organisations. A common strategy to make the experiments more manageable is executing the processing steps as a workflow. In this paper, we look into the implementation of fine-grained data-flow between computational elements in a scientific workflow as streams. We model the distributed computation as a directed acyclic graph where the nodes represent the processing elements that incrementally implement specific subtasks. The processing elements are connected in a pipelined streaming manner, which allows task executions to overlap. We further optimise the execution by splitting pipelines across processes and by introducing extra parallel streams. We identify performance metrics and design a measurement tool to evaluate each enactment. We conducted experiments to evaluate our optimisation strategies with a real world problem in the Life Sciences---EURExpress-II. The paper presents our distributed data-handling model, the optimisation and instrumentation strategies and the evaluation experiments. We demonstrate linear speed up and argue that this use of data-streaming to enable both overlapped pipeline and parallelised enactment is a generally applicable optimisation strategy.