NonStop SQL/MX primitives for knowledge discovery
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
ATLAS: a small but complete SQL extension for data mining and data streams
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
.NET database programmability and extensibility in microsoft SQL server
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Efficient computation of PCA with SVD in SQL
Proceedings of the 2nd Workshop on Data Mining using Matrices and Tensors
MapReduce and parallel DBMSs: friends or foes?
Communications of the ACM - Amir Pnueli: Ahead of His Time
Bayesian Classifiers Programmed in SQL
IEEE Transactions on Knowledge and Data Engineering
Fast UDFs to compute sufficient statistics on large data sets exploiting caching and sampling
Data & Knowledge Engineering
Database systems research on data mining
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
OLAP-based query recommendation
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Statistical Model Computation with UDFs
IEEE Transactions on Knowledge and Data Engineering
A data mining system based on SQL queries and UDFs for relational databases
Proceedings of the 20th ACM international conference on Information and knowledge management
Hi-index | 0.00 |
Most data mining processing is currently performed on flat files outside the DBMS. We propose novel techniques to process such data mining computations inside the DBMS. We focus on the popular Naive Bayes classification algorithm. In contrast to most approaches, our techniques work completely inside the DBMS, exploiting the DBMS programmability mechanisms wherein the user has full access to data, but is transparent to the DBMS internals. Specifically, SQL queries and User-Defined Functions (UDFs) are used to program the Naive Bayes algorithm. We compare these mechanisms with MapReduce, a popular alternative used for large-scale data mining. We study two phases for the classifier: building the model and scoring another data set, using the model as input. Both building and scoring phases with SQL queries involve a single table scan, whereas scoring with UDFs involve two additional scans on large temporary tables. Experiments with large data sets demonstrate SQL queries perform extremely well for building model, while UDFs are better for scoring. In both cases the DBMS performs better than MapReduce. Moreover, the DBMS is significantly more efficient to load data than the file system supporting MapReduce.