Vector and matrix operations programmed with UDFs in a relational DBMS

Authors:
Carlos Ordonez;Javier García-García
Affiliations:
University of Houston, Houston, TX;UNAM University, Mexico City, Mexico
Venue:
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Year:
2006

Citing 15
Cited 8

Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Integrating association rule mining with relational database systems: alternatives and implications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
BOAT—optimistic decision tree construction

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
NonStop SQL/MX primitives for knowledge discovery

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Finding generalized projected clusters in high dimensional spaces

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
SQLEM: fast clustering in SQL using the EM algorithm

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
SQL database primitives for decision tree classifiers

Proceedings of the tenth international conference on Information and knowledge management
FREM: fast and robust EM clustering for large data sets

Proceedings of the eleventh international conference on Information and knowledge management
Optimizing Main-Memory Join on Modern Hardware

IEEE Transactions on Knowledge and Data Engineering
Integrating Data Mining with SQL Databases: OLE DB for Data Mining

Proceedings of the 17th International Conference on Data Engineering
Spreadsheets in RDBMS for OLAP

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Vertical and horizontal percentage aggregations

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Horizontal aggregations for building tabular data sets

Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Programming the K-means clustering algorithm in SQL

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Fundamentals of Database Systems, Fourth Edition

Fundamentals of Database Systems, Fourth Edition

Efficient OLAP with UDFs

Proceedings of the ACM 11th international workshop on Data warehousing and OLAP
Microarray data analysis with PCA in a DBMS

Proceedings of the 2nd international workshop on Data and text mining in bioinformatics
Efficient computation of PCA with SVD in SQL

Proceedings of the 2nd Workshop on Data Mining using Matrices and Tensors
Extend UDF Technology for Integrated Analytics

DaWaK '09 Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery
Efficiently support MapReduce-like computation models inside parallel DBMS

IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
Fast UDFs to compute sufficient statistics on large data sets exploiting caching and sampling

Data & Knowledge Engineering
Extend core UDF framework for GPU-enabled analytical query evaluation

Proceedings of the 15th Symposium on International Database Engineering & Applications
Dynamic optimization of generalized SQL queries with horizontal aggregations

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

In general, a relational DBMS provides limited capabilities to perform multidimensional statistical analysis, which requires manipulating vectors and matrices. In this work, we study how to extend a DBMS with basic vector and matrix operators by programming User-Defined Functions (UDFs). We carefully analyze UDF features and limitations to implement vector and matrix operations commonly used in statistics, machine learning and data mining, paying attention to DBMS, operating system and computer architecture constraints. UDFs represent a C programming interface that allows the definition of scalar and aggregate functions that can be used in SQL. UDFs have several advantages and limitations. A UDF allows fast evaluation of arithmetic expressions, memory manipulation, using multidimensional arrays and exploiting all C language control statements. Nevertheless, a UDF cannot perform disk I/O, the amount of heap and stack memory that can be allocated is small and the UDF code must consider specific architecture characteristics of the DBMS. We experimentally compare UDFs and SQL with respect to performance, ease of use, flexibility and scalability. We profile UDFs based on call overhead, memory management and interleaved disk access. We show UDFs are faster than standard SQL aggregations and as fast as SQL arithmetic expressions.