SQLEM: fast clustering in SQL using the EM algorithm
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Integrating K-Means Clustering with a Relational DBMS Using SQL
IEEE Transactions on Knowledge and Data Engineering
Efficient computation of PCA with SVD in SQL
Proceedings of the 2nd Workshop on Data Mining using Matrices and Tensors
Bayesian Classifiers Programmed in SQL
IEEE Transactions on Knowledge and Data Engineering
Fast UDFs to compute sufficient statistics on large data sets exploiting caching and sampling
Data & Knowledge Engineering
Database systems research on data mining
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Comparing SQL and MapReduce to compute Naive Bayes in a single table scan
CloudDB '10 Proceedings of the second international workshop on Cloud data management
Statistical Model Computation with UDFs
IEEE Transactions on Knowledge and Data Engineering
On the Computation of Stochastic Search Variable Selection in Linear Regression with UDFs
ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining
Hi-index | 0.00 |
Most research on data mining has proposed algorithms and optimizations that work on flat files, outside a DBMS, mainly due to the following reasons. It is easier to develop efficient algorithms in a traditional programming language. The integration of data mining algorithms into a DBMS is difficult given its relational model foundation and system architecture. Moreover, SQL may be slow and cumbersome for numerical analysis computations. Therefore, data mining users commonly export data sets outside the DBMS for data mining processing, which creates a performance bottleneck and eliminates important data management capabilities such as query processing and security, among others (e.g. concurrency control and fault tolerance). With that motivation in mind, we developed a novel system based on SQL queries and User-Defined Functions (UDFs) that can directly analyze relational tables to compute statistical models, storing such models as relational tables as well. Most algorithms have been optimized to reduce the number of passes on the data set. Our system can analyze large and high dimensional data sets faster than external data mining tools.