Query optimization using column statistics in hive

Authors:
Anja Gruenheid;Edward Omiecinski;Leo Mark
Affiliations:
Georgia Institute of Technology;Georgia Institute of Technology;Georgia Institute of Technology
Venue:
Proceedings of the 15th Symposium on International Database Engineering & Applications
Year:
2011

Citing 8
Cited 1

Query evaluation techniques for large databases

ACM Computing Surveys (CSUR)
Improved histograms for selectivity estimation of range predicates

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Query Optimization in Database Systems

ACM Computing Surveys (CSUR)
On the Complexity of Generating Optimal Left-Deep Processing Trees with Cross Products

ICDT '95 Proceedings of the 5th International Conference on Database Theory
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment
Data warehousing and analytics infrastructure at facebook

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Processing theta-joins using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

QMapper: a tool for SQL optimization on hive using query rewriting

Proceedings of the 22nd international conference on World Wide Web companion

Quantified Score

Hi-index	0.00

Visualization

Abstract

Hive is a data warehousing solution on top of the Hadoop MapReduce framework that has been designed to handle large amounts of data and store them in tables like a relational database management system or a conventional data warehouse while using the parallelization and batch processing functionalities of the Hadoop MapReduce framework to speed up the execution of queries. Data inserted into Hive is stored in the Hadoop FileSystem (HDFS), which is part of the Hadoop MapReduce framework. To make the data accessible to the user, Hive uses a query language similar to SQL, which is called HiveQL. When a query is issued in HiveQL, it is translated by a parser into a query execution plan that is optimized and then turned into a series of map and reduce iterations. These iterations are then executed on the data stored in the HDFS, writing the output to a file. The goal of this work is to to develop an approach for improving the performance of the HiveQL queries executed in the Hive framework. For that purpose, we introduce an extension to the Hive MetaStore which stores metadata that has been extracted on the column level of the user database. These column level statistics are then used for example in combination with join ordering algorithms which are adapted to the specific needs of the Hadoop MapReduce environment to improve the overall performance of the HiveQL query execution.