Parallelizing XML data-streaming workflows via MapReduce

Authors:
Daniel Zinn;Shawn Bowers;Sven Köhler;Bertram Ludäscher
Affiliations:
Department of Computer Science, University of California, Davis, United States;UC Davis Genome Center, University of California, Davis, United States and Department of Computer Science, Gonzaga University, United States;UC Davis Genome Center, University of California, Davis, United States;Department of Computer Science, University of California, Davis, United States and UC Davis Genome Center, University of California, Davis, United States
Venue:
Journal of Computer and System Sciences
Year:
2010

Citing 34
Cited 3

NiagaraCQ: a scalable continuous query system for Internet databases

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Sun Grid Engine: Towards Creating a Compute Power Grid

CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
Stream processing of XPath queries with predicates

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Dynamic XML documents with distribution and replication

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
TelegraphCQ: continuous dataflow processing

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
CDuce: an XML-centric general-purpose language

ICFP '03 Proceedings of the eighth ACM SIGPLAN international conference on Functional programming
Implementing a scalable XML publish/subscribe system using relational database systems

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
ORDPATHs: insert-friendly XML node labels

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Processing XML streams with deterministic automata and stream indexes

ACM Transactions on Database Systems (TODS)
The BPEL Orchestrating Framework for Secured Grid Services

ITCC '05 Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume I - Volume 01
An approach for pipelining nested collections in scientific workflows

ACM SIGMOD Record
A notation and system for expressing and executing cleanly typed workflows on messy scientific data

ACM SIGMOD Record
An Efficient XPath Query Processor for XML Streams

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Taverna: lessons in creating a workflow environment for the life sciences: Research Articles

Concurrency and Computation: Practice & Experience - Workflow in Grid Systems
Scientific workflow management and the Kepler system: Research Articles

Concurrency and Computation: Practice & Experience - Workflow in Grid Systems
Workflows for e-Science: Scientific Workflows for Grids

Workflows for e-Science: Scientific Workflows for Grids
Node labeling schemes for dynamic XML documents reconsidered

Data & Knowledge Engineering
Pegasus: A framework for mapping complex scientific workflows onto distributed systems

Scientific Programming
Introduction and evaluation of Martlet: a scientific workflow language for abstracted parallelisation

Proceedings of the 16th international conference on World Wide Web
Map-reduce-merge: simplified relational data processing on large clusters

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Highly distributed XQuery with DXQ

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
ASKALON: A Grid Application Development and Computing Environment

GRID '05 Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing
Workflow automation for processing plasma fusion simulation data

Proceedings of the 2nd workshop on Workflows in support of large-scale science
Schema-based scheduling of event processors and buffer minimization for queries on structured data streams

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
FluXQuery: an optimizing XQuery processor for streaming XML data

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Fault-tolerance in the borealis distributed stream processing system

ACM Transactions on Database Systems (TODS)
Advanced data flow support for scientific grid workflow applications

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Workflows and e-Science: An overview of workflow system features and capabilities

Future Generation Computer Systems
Scientific workflow design for mere mortals

Future Generation Computer Systems
X-CSR: Dataflow Optimization for Distributed XML Process Pipelines

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
A MapReduce-Enabled Scientific Workflow Composition Framework

ICWS '09 Proceedings of the 2009 IEEE International Conference on Web Services
A formal model of dataflow repositories

DILS'07 Proceedings of the 4th international conference on Data integration in the life sciences
Collection-Oriented scientific workflows for integrating and analyzing biological data

DILS'06 Proceedings of the Third international conference on Data Integration in the Life Sciences

A parallel method for computing rough set approximations

Information Sciences: an International Journal
Enforcing QoS in scientific workflow systems enacted over Cloud infrastructures

Journal of Computer and System Sciences
End-to-End QoS on Shared Clouds for Highly Dynamic, Large-Scale Sensing Data Streams

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In prior work it has been shown that the design of scientific workflows can benefit from a collection-oriented modeling paradigm which views scientific workflows as pipelines of XML stream processors. In this paper, we present approaches for exploiting data parallelism in XML processing pipelines through novel compilation strategies to the MapReduce framework. Pipelines in our approach consist of sequences of processing steps that receive XML-structured data and produce, often through calls to ''black-box'' (scientific) functions, modified (i.e., updated) XML structures. Our main contributions are (i) the development of a set of strategies for compiling scientific workflows, modeled as XML processing pipelines, into parallel MapReduce networks, and (ii) a discussion of their advantages and trade-offs, based on a thorough experimental evaluation of the various translation strategies. Our evaluation uses the Hadoop MapReduce system as an implementation platform. Our results show that execution times of XML workflow pipelines can be significantly reduced using our compilation strategies. These efficiency gains, together with the benefits of MapReduce (e.g., fault tolerance) make our approach ideal for executing large-scale, compute-intensive XML-based scientific workflows.