A data mining system based on SQL queries and UDFs for relational databases

Authors:
Carlos Ordonez;Carlos Garcia-Alvarado
Affiliations:
University of Houston, Houston, TX, USA;University of Houston, Houston, TX, USA
Venue:
Proceedings of the 20th ACM international conference on Information and knowledge management
Year:
2011

Citing 9
Cited 0

SQLEM: fast clustering in SQL using the EM algorithm

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Integrating K-Means Clustering with a Relational DBMS Using SQL

IEEE Transactions on Knowledge and Data Engineering
Efficient computation of PCA with SVD in SQL

Proceedings of the 2nd Workshop on Data Mining using Matrices and Tensors
Bayesian Classifiers Programmed in SQL

IEEE Transactions on Knowledge and Data Engineering
Fast UDFs to compute sufficient statistics on large data sets exploiting caching and sampling

Data & Knowledge Engineering
Database systems research on data mining

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Comparing SQL and MapReduce to compute Naive Bayes in a single table scan

CloudDB '10 Proceedings of the second international workshop on Cloud data management
Statistical Model Computation with UDFs

IEEE Transactions on Knowledge and Data Engineering
On the Computation of Stochastic Search Variable Selection in Linear Regression with UDFs

ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most research on data mining has proposed algorithms and optimizations that work on flat files, outside a DBMS, mainly due to the following reasons. It is easier to develop efficient algorithms in a traditional programming language. The integration of data mining algorithms into a DBMS is difficult given its relational model foundation and system architecture. Moreover, SQL may be slow and cumbersome for numerical analysis computations. Therefore, data mining users commonly export data sets outside the DBMS for data mining processing, which creates a performance bottleneck and eliminates important data management capabilities such as query processing and security, among others (e.g. concurrency control and fault tolerance). With that motivation in mind, we developed a novel system based on SQL queries and User-Defined Functions (UDFs) that can directly analyze relational tables to compute statistical models, storing such models as relational tables as well. Most algorithms have been optimized to reduce the number of passes on the data set. Our system can analyze large and high dimensional data sets faster than external data mining tools.