Query optimization using column statistics in hive

  • Authors:
  • Anja Gruenheid;Edward Omiecinski;Leo Mark

  • Affiliations:
  • Georgia Institute of Technology;Georgia Institute of Technology;Georgia Institute of Technology

  • Venue:
  • Proceedings of the 15th Symposium on International Database Engineering & Applications
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Hive is a data warehousing solution on top of the Hadoop MapReduce framework that has been designed to handle large amounts of data and store them in tables like a relational database management system or a conventional data warehouse while using the parallelization and batch processing functionalities of the Hadoop MapReduce framework to speed up the execution of queries. Data inserted into Hive is stored in the Hadoop FileSystem (HDFS), which is part of the Hadoop MapReduce framework. To make the data accessible to the user, Hive uses a query language similar to SQL, which is called HiveQL. When a query is issued in HiveQL, it is translated by a parser into a query execution plan that is optimized and then turned into a series of map and reduce iterations. These iterations are then executed on the data stored in the HDFS, writing the output to a file. The goal of this work is to to develop an approach for improving the performance of the HiveQL queries executed in the Hive framework. For that purpose, we introduce an extension to the Hive MetaStore which stores metadata that has been extracted on the column level of the user database. These column level statistics are then used for example in combination with join ordering algorithms which are adapted to the specific needs of the Hadoop MapReduce environment to improve the overall performance of the HiveQL query execution.