SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions

Authors:
Eric Friedman;Peter Pawlowski;John Cieslewicz
Affiliations:
Aster Data Systems;Aster Data Systems;Aster Data Systems
Venue:
Proceedings of the VLDB Endowment
Year:
2009

Citing 16
Cited 22

Extending a database system with procedures

ACM Transactions on Database Systems (TODS)
Extensible database management systems

ACM SIGMOD Record - Directions for future database research & development
The POSTGRES next generation database management system

Communications of the ACM
Parallel database systems: the future of high performance database systems

Communications of the ACM
Predicate migration: optimizing queries with expensive predicates

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Query execution techniques for caching expensive methods

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
On parallel processing of aggregate and scalar functions in object-relational DBMS

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Optimization of queries with user-defined predicates

ACM Transactions on Database Systems (TODS)
Space-efficient online computation of quantile summaries

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
The Implementation of POSTGRES

IEEE Transactions on Knowledge and Data Engineering
Inclusion of New Types in Relational Data Base Systems

Proceedings of the Second International Conference on Data Engineering
User-Defined Table Operators: Enhancing Extensibility for ORDBMS

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment

Nephele/PACTs: a programming model and execution framework for web-scale analytical processing

Proceedings of the 1st ACM symposium on Cloud computing
MapReduce online

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
MRShare: sharing across multiple queries in MapReduce

Proceedings of the VLDB Endowment
Integrating MapReduce and RDBMSs

Proceedings of the 2010 Conference of the Center for Advanced Studies on Collaborative Research
Query optimization techniques for partitioned tables

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient processing of data warehousing queries in a split execution environment

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
ETLMR: a highly scalable dimensional ETL framework based on mapreduce

DaWaK'11 Proceedings of the 13th international conference on Data warehousing and knowledge discovery
Tagged mapreduce: efficiently computing multi-analytics using mapreduce

DaWaK'11 Proceedings of the 13th international conference on Data warehousing and knowledge discovery
Query optimization for massively parallel data processing

Proceedings of the 2nd ACM Symposium on Cloud Computing
Parallel data processing with MapReduce: a survey

ACM SIGMOD Record
GLADE: a scalable framework for efficient analytics

ACM SIGOPS Operating Systems Review
Oracle in-database hadoop: when mapreduce meets RDBMS

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Adaptive optimizations of recursive queries in teradata

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
ParaLite: Supporting Collective Queries in Database System to Parallelize User-Defined Executable

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Opening the black boxes in data flow optimization

Proceedings of the VLDB Endowment
Iterative parallel data processing with stratosphere: an inside look

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
BigBench: towards an industry standard benchmark for big data analytics

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Large-scale social-media analytics on stratosphere

Proceedings of the 22nd international conference on World Wide Web companion
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
Can we analyze big data inside a DBMS?

Proceedings of the sixteenth international workshop on Data warehousing and OLAP
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
Implementation of data affinity-based distributed parallel processing on a distributed key value store

Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

A user-defined function (UDF) is a powerful database feature that allows users to customize database functionality. Though useful, present UDFs have numerous limitations, including install-time specification of input and output schema and poor ability to parallelize execution. We present a new approach to implementing a UDF, which we call SQL/MapReduce (SQL/MR), that overcomes many of these limitations. We leverage ideas from the MapReduce programming paradigm to provide users with a straightforward API through which they can implement a UDF in the language of their choice. Moreover, our approach allows maximum flexibility as the output schema of the UDF is specified by the function itself at query plan-time. This means that a SQL/MR function is polymorphic. It can process arbitrary input because its behavior as well as output schema are dynamically determined by information available at query plan-time, such as the function's input schema and arbitrary user-provided parameters. This also increases reusability as the same SQL/MR function can be used on inputs with many different schemas or with different user-specified parameters. In this paper we describe the motivation for this new approach to UDFs as well as the implementation within Aster Data Systems' nCluster database. We demonstrate that in the context of massively parallel, shared-nothing database systems, this model of computation facilitates highly scalable computation within the database. We also include examples of new applications that take advantage of this novel UDF framework.