HadoopToSQL: a mapReduce query optimizer

  • Authors:
  • Ming-Yee Iu;Willy Zwaenepoel

  • Affiliations:
  • EPFL, Lausanne, Switzerland;EPFL, Lausanne, Switzerland

  • Venue:
  • Proceedings of the 5th European conference on Computer systems
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

MapReduce is a cost-effective way to achieve scalable performance for many log-processing workloads. These workloads typically process their entire dataset. MapReduce can be inefficient, however, when handling business-oriented workloads, especially when these workloads access only a subset of the data. HadoopToSQL seeks to improve MapReduce performance for the latter class of workloads by transforming MapReduce queries to use the indexing, aggregation and grouping features provided by SQL databases. It statically analyzes the computation performed by the MapReduce queries. The static analysis uses symbolic execution to derive preconditions and postconditions for the map and reduce functions. It then uses this information either to generate input restrictions, which avoid scanning the entire dataset, or to generate equivalent SQL queries, which take advantage of SQL grouping and aggregation features. We demonstrate the performance of MapReduce queries, when optimized by HadoopToSQL, by both single-node and cluster experiments. HadoopToSQL always improves performance over MapReduce and approximates that of hand-written SQL.