Comparing SQL and MapReduce to compute Naive Bayes in a single table scan

Authors:
Sasi K. Pitchaimalai;Carlos Ordonez;Carlos Garcia-Alvarado
Affiliations:
University of Houston, Houston, TX, USA;University of Houston, Houston, TX, USA;University of Houston, Houston, TX, USA
Venue:
CloudDB '10 Proceedings of the second international workshop on Cloud data management
Year:
2010

Citing 11
Cited 1

NonStop SQL/MX primitives for knowledge discovery

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
ATLAS: a small but complete SQL extension for data mining and data streams

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
.NET database programmability and extensibility in microsoft SQL server

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Efficient computation of PCA with SVD in SQL

Proceedings of the 2nd Workshop on Data Mining using Matrices and Tensors
MapReduce and parallel DBMSs: friends or foes?

Communications of the ACM - Amir Pnueli: Ahead of His Time
Bayesian Classifiers Programmed in SQL

IEEE Transactions on Knowledge and Data Engineering
Fast UDFs to compute sufficient statistics on large data sets exploiting caching and sampling

Data & Knowledge Engineering
Database systems research on data mining

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
OLAP-based query recommendation

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Statistical Model Computation with UDFs

IEEE Transactions on Knowledge and Data Engineering

A data mining system based on SQL queries and UDFs for relational databases

Proceedings of the 20th ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most data mining processing is currently performed on flat files outside the DBMS. We propose novel techniques to process such data mining computations inside the DBMS. We focus on the popular Naive Bayes classification algorithm. In contrast to most approaches, our techniques work completely inside the DBMS, exploiting the DBMS programmability mechanisms wherein the user has full access to data, but is transparent to the DBMS internals. Specifically, SQL queries and User-Defined Functions (UDFs) are used to program the Naive Bayes algorithm. We compare these mechanisms with MapReduce, a popular alternative used for large-scale data mining. We study two phases for the classifier: building the model and scoring another data set, using the model as input. Both building and scoring phases with SQL queries involve a single table scan, whereas scoring with UDFs involve two additional scans on large temporary tables. Experiments with large data sets demonstrate SQL queries perform extremely well for building model, while UDFs are better for scoring. In both cases the DBMS performs better than MapReduce. Moreover, the DBMS is significantly more efficient to load data than the file system supporting MapReduce.