Efficiently support MapReduce-like computation models inside parallel DBMS

Authors:
Qiming Chen;Andy Therber;Meichun Hsu;Hans Zeller;Bin Zhang;Ren Wu
Affiliations:
HP Labs, Palo Alto, California;HP TSG SW NED, Cupertino, California;HP Labs, Palo Alto, California;HP TSG SW NED, Cupertino, California;HP Labs, Palo Alto, California;HP Labs, Palo Alto, California
Venue:
IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
Year:
2009

Citing 18
Cited 8

Nested relation based database knowledge representation

SIGMOD '91 Proceedings of the 1991 ACM SIGMOD international conference on Management of data
A Teradata content-based multimedia object manager for massively parallel architectures

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Volcano— An Extensible and Parallel Query Evaluation System

IEEE Transactions on Knowledge and Data Engineering
Block Oriented Processing of Relational Database Operations in Modern Computer Architectures

Proceedings of the 17th International Conference on Data Engineering
Plan-Per-Tuple Optimization Solution - Parallel Execution of Expensive User-Defined Functions

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
User-Defined Table Operators: Enhancing Extensibility for ORDBMS

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
A Transactional Model for Long-Running Activities

VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
Inter-Enterprise Collaborative Business Process Management

Proceedings of the 17th International Conference on Data Engineering
Buffering databse operations for enhanced instruction cache performance

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Scientific data management in the coming decade

ACM SIGMOD Record
Experiences with MapReduce, an abstraction for large-scale computation

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Vector and matrix operations programmed with UDFs in a relational DBMS

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Map-reduce-merge: simplified relational data processing on large clusters

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Clustera: an integrated computation and data management system

Proceedings of the VLDB Endowment
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
PNUTS: Yahoo!'s hosted data serving platform

Proceedings of the VLDB Endowment

Generalized UDF for analytics inside database engine

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Scale out parallel and distributed CDR stream analytics

Globe'10 Proceedings of the Third international conference on Data management in grid and peer-to-peer systems
Integrating MapReduce and RDBMSs

Proceedings of the 2010 Conference of the Center for Advanced Studies on Collaborative Research
Data stream analytics as cloud service for mobile applications

OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems: Part II
Continuous mapreduce for In-DB stream analytics

OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems
Experience in Continuous analytics as a Service (CaaaS)

Proceedings of the 14th International Conference on Extending Database Technology
Column-oriented storage techniques for MapReduce

Proceedings of the VLDB Endowment
Extend core UDF framework for GPU-enabled analytical query evaluation

Proceedings of the 15th Symposium on International Database Engineering & Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

While parallel DBMSs do support large scale parallel query processing on partitioned data, the reach of more general applications relies on User Defined Functions (UDFs). However, the existent UDF technology is insufficient both conceptually and practically. A UDF is not a relation-in, relation-out operator, which restricts its ability to model complex applications defined on a set of tuples rather than on a single one, and to be composed with other relational operators in a query. Further, to interact with the query execution efficiently, a UDF must be coded with complex interactions with DBMS internal data structures and system calls which is often beyond the expertise of an analytics application developer. To solve these problems, we start with wrapping general applications with Relation Valued Functions (RVFs); then based on the notion of invocation patterns, we provide focused system support for efficiently integrating RVF execution into the query processing pipeline. We further distinguish the system responsibility and the user responsibility in RVF development, by separating an RVF into the RVF-Shell for dealing with system interaction, and the user-function for pure application logic, such that the RVF-Shell can be constructed in terms of high-level APIs. These mechanisms enable us to solve the essential problems in supporting MapReduce and other analytics computation models inside a parallel database engine: modeling complex applications, integrating them into query processing, and shielding analytics developers from DBMS internal details. Prototyped on a commercial and proprietary parallel database engine, our experience reveals the practical value of the proposed approaches.