An optimization framework for map-reduce queries

  • Authors:
  • Leonidas Fegaras;Chengkai Li;Upa Gupta

  • Affiliations:
  • University of Texas at Arlington, CSE Arlington, TX;University of Texas at Arlington, CSE Arlington, TX;University of Texas at Arlington, CSE Arlington, TX

  • Venue:
  • Proceedings of the 15th International Conference on Extending Database Technology
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present an effective optimization framework for general SQL-like map-reduce queries, which is based on a novel query algebra and uses a small number of higher-order physical operators that are directly implementable on existing map-reduce systems, such as Hadoop. Although our framework is applicable to any SQL-like map-reduce query language, we focus on a powerful query language, called MRQL. Current map-reduce query languages, such as HiveQL and PigLatin, enable users to plug-in custom map-reduce scripts into queries for those jobs that cannot be declaratively coded in the query language, which may result to suboptimal, error-prone, and hard-to-maintain code. In contrast to these languages, MRQL is expressive enough to capture most of these computations in declarative form and at the same time is amenable to optimization. We describe an optimization framework that maps the algebraic forms derived from the MRQL queries to efficient workflows of map-reduce operations that consist of our physical plan operators. We also describe many algebraic optimizations, such as fusing cascading map-reduce jobs into one job and synthesizing a combine function from the reduce function of a map-reduce job. Finally, we report on a prototype system implementation and we show some performance results of evaluating MRQL queries on a small cluster of computers.