NiagaraCQ: a scalable continuous query system for Internet databases
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Sun Grid Engine: Towards Creating a Compute Power Grid
CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
Stream processing of XPath queries with predicates
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Dynamic XML documents with distribution and replication
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
TelegraphCQ: continuous dataflow processing
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
CDuce: an XML-centric general-purpose language
ICFP '03 Proceedings of the eighth ACM SIGPLAN international conference on Functional programming
Implementing a scalable XML publish/subscribe system using relational database systems
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
ORDPATHs: insert-friendly XML node labels
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Processing XML streams with deterministic automata and stream indexes
ACM Transactions on Database Systems (TODS)
The BPEL Orchestrating Framework for Secured Grid Services
ITCC '05 Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume I - Volume 01
An approach for pipelining nested collections in scientific workflows
ACM SIGMOD Record
An Efficient XPath Query Processor for XML Streams
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Taverna: lessons in creating a workflow environment for the life sciences: Research Articles
Concurrency and Computation: Practice & Experience - Workflow in Grid Systems
Scientific workflow management and the Kepler system: Research Articles
Concurrency and Computation: Practice & Experience - Workflow in Grid Systems
Workflows for e-Science: Scientific Workflows for Grids
Workflows for e-Science: Scientific Workflows for Grids
Node labeling schemes for dynamic XML documents reconsidered
Data & Knowledge Engineering
Pegasus: A framework for mapping complex scientific workflows onto distributed systems
Scientific Programming
Proceedings of the 16th international conference on World Wide Web
Map-reduce-merge: simplified relational data processing on large clusters
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Highly distributed XQuery with DXQ
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
ASKALON: A Grid Application Development and Computing Environment
GRID '05 Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing
Workflow automation for processing plasma fusion simulation data
Proceedings of the 2nd workshop on Workflows in support of large-scale science
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
FluXQuery: an optimizing XQuery processor for streaming XML data
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
Fault-tolerance in the borealis distributed stream processing system
ACM Transactions on Database Systems (TODS)
Advanced data flow support for scientific grid workflow applications
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Workflows and e-Science: An overview of workflow system features and capabilities
Future Generation Computer Systems
Scientific workflow design for mere mortals
Future Generation Computer Systems
X-CSR: Dataflow Optimization for Distributed XML Process Pipelines
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
A MapReduce-Enabled Scientific Workflow Composition Framework
ICWS '09 Proceedings of the 2009 IEEE International Conference on Web Services
A formal model of dataflow repositories
DILS'07 Proceedings of the 4th international conference on Data integration in the life sciences
Collection-Oriented scientific workflows for integrating and analyzing biological data
DILS'06 Proceedings of the Third international conference on Data Integration in the Life Sciences
A parallel method for computing rough set approximations
Information Sciences: an International Journal
Enforcing QoS in scientific workflow systems enacted over Cloud infrastructures
Journal of Computer and System Sciences
End-to-End QoS on Shared Clouds for Highly Dynamic, Large-Scale Sensing Data Streams
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Hi-index | 0.00 |
In prior work it has been shown that the design of scientific workflows can benefit from a collection-oriented modeling paradigm which views scientific workflows as pipelines of XML stream processors. In this paper, we present approaches for exploiting data parallelism in XML processing pipelines through novel compilation strategies to the MapReduce framework. Pipelines in our approach consist of sequences of processing steps that receive XML-structured data and produce, often through calls to ''black-box'' (scientific) functions, modified (i.e., updated) XML structures. Our main contributions are (i) the development of a set of strategies for compiling scientific workflows, modeled as XML processing pipelines, into parallel MapReduce networks, and (ii) a discussion of their advantages and trade-offs, based on a thorough experimental evaluation of the various translation strategies. Our evaluation uses the Hadoop MapReduce system as an implementation platform. Our results show that execution times of XML workflow pipelines can be significantly reduced using our compilation strategies. These efficiency gains, together with the benefits of MapReduce (e.g., fault tolerance) make our approach ideal for executing large-scale, compute-intensive XML-based scientific workflows.