Efficiently support MapReduce-like computation models inside parallel DBMS

  • Authors:
  • Qiming Chen;Andy Therber;Meichun Hsu;Hans Zeller;Bin Zhang;Ren Wu

  • Affiliations:
  • HP Labs, Palo Alto, California;HP TSG SW NED, Cupertino, California;HP Labs, Palo Alto, California;HP TSG SW NED, Cupertino, California;HP Labs, Palo Alto, California;HP Labs, Palo Alto, California

  • Venue:
  • IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

While parallel DBMSs do support large scale parallel query processing on partitioned data, the reach of more general applications relies on User Defined Functions (UDFs). However, the existent UDF technology is insufficient both conceptually and practically. A UDF is not a relation-in, relation-out operator, which restricts its ability to model complex applications defined on a set of tuples rather than on a single one, and to be composed with other relational operators in a query. Further, to interact with the query execution efficiently, a UDF must be coded with complex interactions with DBMS internal data structures and system calls which is often beyond the expertise of an analytics application developer. To solve these problems, we start with wrapping general applications with Relation Valued Functions (RVFs); then based on the notion of invocation patterns, we provide focused system support for efficiently integrating RVF execution into the query processing pipeline. We further distinguish the system responsibility and the user responsibility in RVF development, by separating an RVF into the RVF-Shell for dealing with system interaction, and the user-function for pure application logic, such that the RVF-Shell can be constructed in terms of high-level APIs. These mechanisms enable us to solve the essential problems in supporting MapReduce and other analytics computation models inside a parallel database engine: modeling complex applications, integrating them into query processing, and shielding analytics developers from DBMS internal details. Prototyped on a commercial and proprietary parallel database engine, our experience reveals the practical value of the proposed approaches.