Parallel labeling of massive XML data with MapReduce

Authors:
Hyebong Choi;Kyong-Ha Lee;Yoon-Joon Lee
Affiliations:
Department of Computer Science, KAIST, Yuseong-gu, Daejeon, Republic of Korea 305-701;Intelligent Convergence Media Research Department, Broadcasting & Telecommunications Media Research Laboratory, ETRI, Yuseong-gu, Daejeon, Republic of Korea 305-700;Department of Computer Science, KAIST, Yuseong-gu, Daejeon, Republic of Korea 305-701
Venue:
The Journal of Supercomputing
Year:
2014

Citing 20
Cited 0

Extensible markup language

World Wide Web Journal - Special issue on XML: principles, tools, and techniques
On supporting containment queries in relational database management systems

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Introduction to algorithms

Introduction to algorithms
Storing and querying ordered XML using a relational database system

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Holistic twig joins: optimal XML pattern matching

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Indexing and Querying XML Data for Regular Path Expressions

Proceedings of the 27th International Conference on Very Large Data Bases
Maintaining order in a linked list

STOC '82 Proceedings of the fourteenth annual ACM symposium on Theory of computing
ORDPATHs: insert-friendly XML node labels

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
System RX: one part relational, one part XML

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Towards an enterprise XML architecture

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
A Static Load-Balancing Scheme for Parallel XML Parsing on Multicore CPUs

CCGRID '07 Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid
XMark: a benchmark for XML data management

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
A Parallel Approach to XML Parsing

GRID '06 Proceedings of the 7th IEEE/ACM International Conference on Grid Computing
A Data Parallel Algorithm for XML DOM Parsing

XSym '09 Proceedings of the 6th International XML Database Symposium on Database and XML Technologies
XQuery Full Text Implementation in BaseX

XSym '09 Proceedings of the 6th International XML Database Symposium on Database and XML Technologies
LEEN: Locality/Fairness-Aware Key Partitioning for MapReduce in the Cloud

CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science
Parallel data processing with MapReduce: a survey

ACM SIGMOD Record
SkewTune: mitigating skew in mapreduce applications

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
HadoopXML: a suite for parallel processing of massive XML data with multiple twig pattern queries

Proceedings of the 21st ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

The volume of XML data has become enormous and still grows very quickly as many data have been typed in XML by virtue of its simplicity and extensibility. While a tree labeling algorithm has a crucial role in XML query processing, conventional algorithms are all sequential so that they fail to label a large volume of XML data in a timely manner. To address this issue, we devise parallel tree labeling algorithms for massive XML data. Specifically, we focus on how to efficiently label a single large XML file in parallel. We first propose parallel versions of two prominent tree labeling schemes based on the MapReduce framework. We then present techniques for runtime workload balancing and data repartition to solve performance issues caused by data skewness and MapReduce's inherited limitation. Through extensive experiments with synthetic and real-world datasets on 15 nodes, we show that our parallel labeling algorithms are up to 17 times faster than conventional algorithms, providing strong durability against data skewness.