Bayesian Classifiers Programmed in SQL

Authors:
Carlos Ordonez;Sasi K. Pitchaimalai
Affiliations:
University of Houston, Houston;University of Houston, Houston
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2010

Citing 0
Cited 6

Comparing SQL and MapReduce to compute Naive Bayes in a single table scan

CloudDB '10 Proceedings of the second international workshop on Cloud data management
Evaluating association rules and decision trees to predict multiple target attributes

Intelligent Data Analysis
A data mining system based on SQL queries and UDFs for relational databases

Proceedings of the 20th ACM international conference on Information and knowledge management
A Naïve-Bayesian methodology to classify echo cardiographic images through SQL

KICSS'10 Proceedings of the 5th international conference on Knowledge, information, and creativity support systems
SQL based cardiovascular ultrasound image classification

International Journal of Data Mining and Bioinformatics
A fast convergence clustering algorithm merging MCMC and EM methods

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Bayesian classifier is a fundamental classification technique. In this work, we focus on programming Bayesian classifiers in SQL. We introduce two classifiers: Naive Bayes and a classifier based on class decomposition using K-means clustering. We consider two complementary tasks: model computation and scoring a data set. We study several layouts for tables and several indexing alternatives. We analyze how to transform equations into efficient SQL queries and introduce several query optimizations. We conduct experiments with real and synthetic data sets to evaluate classification accuracy, query optimizations, and scalability. Our Bayesian classifier is more accurate than Naive Bayes and decision trees. Distance computation is significantly accelerated with horizontal layout for tables, denormalization, and pivoting. We also compare Naive Bayes implementations in SQL and C++: SQL is about four times slower. Our Bayesian classifier in SQL achieves high classification accuracy, can efficiently analyze large data sets, and has linear scalability.