Comparing SQL and MapReduce to compute Naive Bayes in a single table scan

  • Authors:
  • Sasi K. Pitchaimalai;Carlos Ordonez;Carlos Garcia-Alvarado

  • Affiliations:
  • University of Houston, Houston, TX, USA;University of Houston, Houston, TX, USA;University of Houston, Houston, TX, USA

  • Venue:
  • CloudDB '10 Proceedings of the second international workshop on Cloud data management
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Most data mining processing is currently performed on flat files outside the DBMS. We propose novel techniques to process such data mining computations inside the DBMS. We focus on the popular Naive Bayes classification algorithm. In contrast to most approaches, our techniques work completely inside the DBMS, exploiting the DBMS programmability mechanisms wherein the user has full access to data, but is transparent to the DBMS internals. Specifically, SQL queries and User-Defined Functions (UDFs) are used to program the Naive Bayes algorithm. We compare these mechanisms with MapReduce, a popular alternative used for large-scale data mining. We study two phases for the classifier: building the model and scoring another data set, using the model as input. Both building and scoring phases with SQL queries involve a single table scan, whereas scoring with UDFs involve two additional scans on large temporary tables. Experiments with large data sets demonstrate SQL queries perform extremely well for building model, while UDFs are better for scoring. In both cases the DBMS performs better than MapReduce. Moreover, the DBMS is significantly more efficient to load data than the file system supporting MapReduce.