Dryad: distributed data-parallel programs from sequential building blocks
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Nephele/PACTs: a programming model and execution framework for web-scale analytical processing
Proceedings of the 1st ACM symposium on Cloud computing
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Dremel: interactive analysis of web-scale datasets
Proceedings of the VLDB Endowment
ASTERIX: towards a scalable, semistructured data platform for evolving-world models
Distributed and Parallel Databases
Entity matching for semistructured data in the Cloud
Proceedings of the 27th Annual ACM Symposium on Applied Computing
Hi-index | 0.00 |
Hadoop provides an economical tool for processing large amounts of data; its success has been fueled in part by features such as fault-tolerance and a simple processing model. The amount of XML used in scientific, government, and enterprise data has grown substantially. and there are several high-level languages developed for Hadoop that can process semi-structured data like XML. ChuQL is a recently proposed extension to XQuery for processing XML natively using Hadoop. The current implementation of ChuQL leverages an existing main-memory XQuery processor and faces two challenges; intermediate XML values growing larger than memory and huge quantities of output files. We describe two ChuQL constructs to overcome these limitations: using an iterator to process XML value sequences, and partitioning the job output. We give experimental evidence to help evaluate the tradeoffs when using these advanced ChuQL features.