ChuQL: processing XML with XQuery using Hadoop

  • Authors:
  • Shahan Khatchadourian;Mariano Consens;Jérôme Siméon

  • Affiliations:
  • University of Toronto;University of Toronto;IBM Watson Research

  • Venue:
  • Proceedings of the 2011 Conference of the Center for Advanced Studies on Collaborative Research
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Hadoop provides an economical tool for processing large amounts of data; its success has been fueled in part by features such as fault-tolerance and a simple processing model. The amount of XML used in scientific, government, and enterprise data has grown substantially. and there are several high-level languages developed for Hadoop that can process semi-structured data like XML. ChuQL is a recently proposed extension to XQuery for processing XML natively using Hadoop. The current implementation of ChuQL leverages an existing main-memory XQuery processor and faces two challenges; intermediate XML values growing larger than memory and huge quantities of output files. We describe two ChuQL constructs to overcome these limitations: using an iterator to process XML value sequences, and partitioning the job output. We give experimental evidence to help evaluate the tradeoffs when using these advanced ChuQL features.