SciHadoop: array-based query processing in Hadoop

Authors:
Joe B. Buck;Noah Watkins;Jeff LeFevre;Kleoni Ioannidou;Carlos Maltzahn;Neoklis Polyzotis;Scott Brandt
Affiliations:
UC Santa Cruz;UC Santa Cruz;UC Santa Cruz;UC Santa Cruz;UC Santa Cruz;UC Santa Cruz;UC Santa Cruz
Venue:
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Year:
2011

Citing 13
Cited 8

A query language for multidimensional arrays: design, implementation, and optimization techniques

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
TAG: a Tiny AGgregation service for Ad-Hoc sensor networks

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Index-based multidimensional array queries: safety and equivalence

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Clustered Workflow Execution of Retargeted Data Analysis Scripts

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Kepler + Hadoop: a general architecture facilitating data-intensive applications in scientific workflow systems

Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science
HadoopToSQL: a mapReduce query optimizer

Proceedings of the 5th European conference on Computer systems
Overview of sciDB: large scale array storage, processing and analysis

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Parallel accessing massive NetCDF data based on mapreduce

WISM'10 Proceedings of the 2010 international conference on Web information systems and mining
MapReduce in the Clouds for Science

CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science
HAMA: An Efficient Matrix Computation with the MapReduce Framework

CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science
ArrayStore: a storage manager for complex parallel array processing

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Distribution rules for array database queries

DEXA'05 Proceedings of the 16th international conference on Database and Expert Systems Applications

Supporting User-Defined Subsetting and Aggregation over Parallel NetCDF Datasets

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
I/O acceleration with pattern detection

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Cumulon: optimizing statistical data analysis in the cloud

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Turning scientists into data explorers

Proceedings of the 2013 Sigmod/PODS Ph.D. symposium on PhD symposium
Astronomical data processing in EXTASCID

Proceedings of the 25th International Conference on Scientific and Statistical Database Management
SIDR: structure-aware intelligent data routing in Hadoop

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
SDQuery DSI: integrating data management support with a wide area data transfer protocol

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Hadoop has become the de facto platform for large-scale data analysis in commercial applications, and increasingly so in scientific applications. However, Hadoop's byte stream data model causes inefficiencies when used to process scientific data that is commonly stored in highly-structured, array-based binary file formats resulting in limited scalability of Hadoop applications in science. We introduce Sci-Hadoop, a Hadoop plugin allowing scientists to specify logical queries over array-based data models. Sci-Hadoop executes queries as map/reduce programs defined over the logical data model. We describe the implementation of a Sci-Hadoop prototype for NetCDF data sets and quantify the performance of five separate optimizations that address the following goals for several representative aggregate queries: reduce total data transfers, reduce remote reads, and reduce unnecessary reads. Two optimizations allow holistic aggregate queries to be evaluated opportunistically during the map phase; two additional optimizations intelligently partition input data to increase read locality, and one optimization avoids block scans by examining the data dependencies of an executing query to prune input partitions. Experiments involving a holistic function show run-time improvements of up to 8x, with drastic reductions of IO, both locally and over the network.