ChuQL: processing XML with XQuery using Hadoop

Authors:
Shahan Khatchadourian;Mariano Consens;Jérôme Siméon
Affiliations:
University of Toronto;University of Toronto;IBM Watson Research
Venue:
Proceedings of the 2011 Conference of the Center for Advanced Studies on Collaborative Research
Year:
2011

Citing 7
Cited 1

Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Nephele/PACTs: a programming model and execution framework for web-scale analytical processing

Proceedings of the 1st ACM symposium on Cloud computing
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Dremel: interactive analysis of web-scale datasets

Proceedings of the VLDB Endowment
ASTERIX: towards a scalable, semistructured data platform for evolving-world models

Distributed and Parallel Databases

Entity matching for semistructured data in the Cloud

Proceedings of the 27th Annual ACM Symposium on Applied Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Hadoop provides an economical tool for processing large amounts of data; its success has been fueled in part by features such as fault-tolerance and a simple processing model. The amount of XML used in scientific, government, and enterprise data has grown substantially. and there are several high-level languages developed for Hadoop that can process semi-structured data like XML. ChuQL is a recently proposed extension to XQuery for processing XML natively using Hadoop. The current implementation of ChuQL leverages an existing main-memory XQuery processor and faces two challenges; intermediate XML values growing larger than memory and huge quantities of output files. We describe two ChuQL constructs to overcome these limitations: using an iterator to process XML value sequences, and partitioning the job output. We give experimental evidence to help evaluate the tradeoffs when using these advanced ChuQL features.