A query language for multidimensional arrays: design, implementation, and optimization techniques
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
TAG: a Tiny AGgregation service for Ad-Hoc sensor networks
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Index-based multidimensional array queries: safety and equivalence
Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Clustered Workflow Execution of Retargeted Data Analysis Scripts
CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science
HadoopToSQL: a mapReduce query optimizer
Proceedings of the 5th European conference on Computer systems
Overview of sciDB: large scale array storage, processing and analysis
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Parallel accessing massive NetCDF data based on mapreduce
WISM'10 Proceedings of the 2010 international conference on Web information systems and mining
MapReduce in the Clouds for Science
CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science
HAMA: An Efficient Matrix Computation with the MapReduce Framework
CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science
ArrayStore: a storage manager for complex parallel array processing
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Distribution rules for array database queries
DEXA'05 Proceedings of the 16th international conference on Database and Expert Systems Applications
Supporting User-Defined Subsetting and Aggregation over Parallel NetCDF Datasets
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
I/O acceleration with pattern detection
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Cumulon: optimizing statistical data analysis in the cloud
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Turning scientists into data explorers
Proceedings of the 2013 Sigmod/PODS Ph.D. symposium on PhD symposium
Astronomical data processing in EXTASCID
Proceedings of the 25th International Conference on Scientific and Statistical Database Management
SIDR: structure-aware intelligent data routing in Hadoop
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
SDQuery DSI: integrating data management support with a wide area data transfer protocol
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
Hadoop has become the de facto platform for large-scale data analysis in commercial applications, and increasingly so in scientific applications. However, Hadoop's byte stream data model causes inefficiencies when used to process scientific data that is commonly stored in highly-structured, array-based binary file formats resulting in limited scalability of Hadoop applications in science. We introduce Sci-Hadoop, a Hadoop plugin allowing scientists to specify logical queries over array-based data models. Sci-Hadoop executes queries as map/reduce programs defined over the logical data model. We describe the implementation of a Sci-Hadoop prototype for NetCDF data sets and quantify the performance of five separate optimizations that address the following goals for several representative aggregate queries: reduce total data transfers, reduce remote reads, and reduce unnecessary reads. Two optimizations allow holistic aggregate queries to be evaluated opportunistically during the map phase; two additional optimizations intelligently partition input data to increase read locality, and one optimization avoids block scans by examining the data dependencies of an executing query to prune input partitions. Experiments involving a holistic function show run-time improvements of up to 8x, with drastic reductions of IO, both locally and over the network.