Split query processing in polybase

Authors:
David J. DeWitt;Alan Halverson;Rimma Nehme;Srinath Shankar;Josep Aguilar-Saborit;Artin Avanes;Miro Flasza;Jim Gramling
Affiliations:
Microsoft Corporation, Madison, WI, USA;Microsoft Corporation, Madison, WI, USA;Microsoft Corporation, Madison, WI, Uganda;Microsoft Corporation, Madison, WI, USA;Microsoft Corporation, Aliso Viejo, DC, USA;Microsoft Corporation, Aliso Viejo, CA, USA;Microsoft Corporation, Aliso Viejo, CA, USA;Microsoft Corporation, Aliso Viejo, CA, USA
Venue:
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Year:
2013

Citing 5
Cited 1

Integrating hadoop and parallel DBMs

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
A Hadoop based distributed loading approach to parallel data warehouses

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient processing of data warehousing queries in a split execution environment

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Query optimization for massively parallel data processing

Proceedings of the 2nd ACM Symposium on Cloud Computing
Query optimization in microsoft SQL server PDW

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

Instant loading for main memory databases

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents Polybase, a feature of SQL Server PDW V2 that allows users to manage and query data stored in a Hadoop cluster using the standard SQL query language. Unlike other database systems that provide only a relational view over HDFS-resident data through the use of an external table mechanism, Polybase employs a split query processing paradigm in which SQL operators on HDFS-resident data are translated into MapReduce jobs by the PDW query optimizer and then executed on the Hadoop cluster. The paper describes the design and implementation of Polybase along with a thorough performance evaluation that explores the benefits of employing a split query processing paradigm for executing queries that involve both structured data in a relational DBMS and unstructured data in Hadoop. Our results demonstrate that while the use of a split-based query execution paradigm can improve the performance of some queries by as much as 10X, one must employ a cost-based query optimizer that considers a broad set of factors when deciding whether or not it is advantageous to push a SQL operator to Hadoop. These factors include the selectivity factor of the predicate, the relative sizes of the two clusters, and whether or not their nodes are co-located. In addition, differences in the semantics of the Java and SQL languages must be carefully considered in order to avoid altering the expected results of a query.