HadoopXML: a suite for parallel processing of massive XML data with multiple twig pattern queries

Authors:
Hyebong Choi;Kyong-Ha Lee;Soo-Hyong Kim;Yoon-Joon Lee;Bongki Moon
Affiliations:
KAIST, Daejeon, South Korea;KAIST, Daejeon, South Korea;KAIST, Daejeon, South Korea;KAIST, Daejeon, South Korea;University of Arizona, Tucson, AZ, USA
Venue:
Proceedings of the 21st ACM international conference on Information and knowledge management
Year:
2012

Citing 6
Cited 3

Holistic twig joins: optimal XML pattern matching

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Indexing and Querying XML Data for Regular Path Expressions

Proceedings of the 27th International Conference on Very Large Data Bases
Path sharing and predicate evaluation for high-performance XML filtering

ACM Transactions on Database Systems (TODS)
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
MRShare: sharing across multiple queries in MapReduce

Proceedings of the VLDB Endowment
Parallel data processing with MapReduce: a survey

ACM SIGMOD Record

Web data indexing in the cloud: efficiency and cost reductions

Proceedings of the 16th International Conference on Extending Database Technology
Processing XML queries and updates on map/reduce clusters

Proceedings of the 16th International Conference on Extending Database Technology
Parallel labeling of massive XML data with MapReduce

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The volume of XML data is tremendous in many areas, but especially in data logging and scientific areas. XML data in the areas are accumulated over time as new data are continuously collected. It is a challenge to process massive XML data with multiple twig pattern queries given by multiple users in a timely manner. We showcase HadoopXML, a system that simultaneously processes many twig pattern queries for a massive volume of XML data with Hadoop. Specifically, HadoopXML provides an efficient way to process a single large XML file in parallel. It processes multiple twig pattern queries simultaneously with a shared input scan. Users do not need to iterate M/R jobs for each query. HadoopXML also reduces many I/Os by enabling twig pattern queries to share their path solutions each other. Moreover, HadoopXML provides a sophisticated runtime load balancing scheme for fairly assigning multiple twig pattern joins across nodes. With synthetic and real world XML dataset, we demonstrate how efficiently HadoopXML processes many twig pattern queries in a shared and balanced way.