Statistical Model Computation with UDFs

Authors:
Carlos Ordonez
Affiliations:
University of Houston, Houston
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2010

Citing 0
Cited 15

Database systems research on data mining

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
OLAP-based query recommendation

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Comparing SQL and MapReduce to compute Naive Bayes in a single table scan

CloudDB '10 Proceedings of the second international workshop on Cloud data management
Relational versus non-relational database systems for data warehousing

DOLAP '10 Proceedings of the ACM 13th international workshop on Data warehousing and OLAP
One-pass data mining algorithms in a DBMS with UDFs

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
ONTOCUBE: efficient ontology extraction using OLAP cubes

Proceedings of the 20th ACM international conference on Information and knowledge management
A data mining system based on SQL queries and UDFs for relational databases

Proceedings of the 20th ACM international conference on Information and knowledge management
Dynamic optimization of generalized SQL queries with horizontal aggregations

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
The MADlib analytics library: or MAD skills, the SQL

Proceedings of the VLDB Endowment
Fast PCA computation in a DBMS with aggregate UDFs and LAPACK

Proceedings of the 21st ACM international conference on Information and knowledge management
Data mining algorithms as a service in the cloud exploiting relational database systems

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Bayesian variable selection for linear regression in high dimensional microarray data

Proceedings of the 7th international workshop on Data and text mining in biomedical informatics
Clustering cubes with binary dimensions in one pass

Proceedings of the sixteenth international workshop on Data warehousing and OLAP
Optimizing OLAP cube processing on solid state drives

Proceedings of the sixteenth international workshop on Data warehousing and OLAP
Can we analyze big data inside a DBMS?

Proceedings of the sixteenth international workshop on Data warehousing and OLAP

Quantified Score

Hi-index	0.02

Visualization

Abstract

Statistical models are generally computed outside a DBMS due to their mathematical complexity. We introduce techniques to efficiently compute fundamental statistical models inside a DBMS exploiting User-Defined Functions (UDFs). Specifically, we study the computation of linear regression, PCA, clustering, and Naive Bayes. Two summary matrices on the data set are mathematically shown to be essential for all models: the linear sum of points and the quadratic sum of cross products of points. We consider two layouts for the input data set: horizontal and vertical. We first introduce efficient SQL queries to compute summary matrices and score the data set. Based on the SQL framework, we introduce UDFs that work in a single table scan: aggregate UDFs to compute summary matrices for all models and a set of primitive scalar UDFs to score data sets. Experiments compare UDFs and SQL queries (running inside the DBMS) with C++ (analyzing exported files). In general, UDFs are faster than SQL queries and not much slower than C++. Considering export times, C++ is slower than UDFs and SQL queries. Statistical models based on precomputed summary matrices are computed in a few seconds. UDFs scale linearly and only require one table scan, highlighting their efficiency.