Integrating K-Means Clustering with a Relational DBMS Using SQL

Authors:
Carlos Ordonez
Affiliations:
-
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2006

Citing 24
Cited 14

BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
The KDD process for extracting useful knowledge from volumes of data

Communications of the ACM
Integrating association rule mining with relational database systems: alternatives and implications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
DBMiner: interactive mining of multiple-level knowledge in relational databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
A unifying review of linear Gaussian models

Neural Computation
Accelerating exact k-means algorithms with geometric reasoning

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
NonStop SQL/MX primitives for knowledge discovery

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Finding generalized projected clusters in high dimensional spaces

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
SQLEM: fast clustering in SQL using the EM algorithm

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Scalability for clustering algorithms revisited

ACM SIGKDD Explorations Newsletter
SQL database primitives for decision tree classifiers

Proceedings of the tenth international conference on Information and knowledge management
FREM: fast and robust EM clustering for large data sets

Proceedings of the eleventh international conference on Information and knowledge management
Alternatives to the k-means algorithm that find better clusterings

Proceedings of the eleventh international conference on Information and knowledge management
An Extension to SQL for Mining Association Rules

Data Mining and Knowledge Discovery
MSQL: A Query Language for Database Mining

Data Mining and Knowledge Discovery
The LBG-U Method for Vector Quantization – an Improvement over LBGInspired from Neural Networks

Neural Processing Letters
Ad Hoc Association Rule Mining as SQL3 Queries

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Spreadsheets in RDBMS for OLAP

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Clustering binary data streams with K-means

DMKD '03 Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Clustering gene expression data in SQL using locally adaptive metrics

DMKD '03 Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Efficient Disk-Based K-Means Clustering for Relational Databases

IEEE Transactions on Knowledge and Data Engineering
Programming the K-means clustering algorithm in SQL

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
ATLAS: a small but complete SQL extension for data mining and data streams

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
On convergence properties of the em algorithm for gaussian mixtures

Neural Computation

Building statistical models and scoring with UDFs

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Efficient online mining of large databases

International Journal of Business Information Systems
Data mining for decision support in multiple-model system identification

SMO'06 Proceedings of the 6th WSEAS International Conference on Simulation, Modelling and Optimization
Models for association rules based on clustering and correlation

Intelligent Data Analysis
I/O scalable Bregman co-clustering

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Database systems research on data mining

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Evaluating association rules and decision trees to predict multiple target attributes

Intelligent Data Analysis
A data mining system based on SQL queries and UDFs for relational databases

Proceedings of the 20th ACM international conference on Information and knowledge management
Combining two data mining methods for system identification

EG-ICE'06 Proceedings of the 13th international conference on Intelligent Computing in Engineering and Architecture
Scalable k-means++

Proceedings of the VLDB Endowment
The MADlib analytics library: or MAD skills, the SQL

Proceedings of the VLDB Endowment
SQL based cardiovascular ultrasound image classification

International Journal of Data Mining and Bioinformatics
Clustering cubes with binary dimensions in one pass

Proceedings of the sixteenth international workshop on Data warehousing and OLAP
Can we analyze big data inside a DBMS?

Proceedings of the sixteenth international workshop on Data warehousing and OLAP

Quantified Score

Hi-index	0.00

Visualization

Abstract

Integrating data mining algorithms with a relational DBMS is an important problem for database programmers. We introduce three SQL implementations of the popular K-means clustering algorithm to integrate it with a relational DBMS: 1) a straightforward translation of K-means computations into SQL, 2) an optimized version based on improved data organization, efficient indexing, sufficient statistics, and rewritten queries, and 3) an incremental version that uses the optimized version as a building block with fast convergence and automated reseeding. We experimentally show the proposed K-means implementations work correctly and can cluster large data sets. We identify which K-means computations are more critical for performance. The optimized and incremental K-means implementations exhibit linear scalability. We compare K-means implementations in SQL and C++ with respect to speed and scalability and we also study the time to export data sets outside of the DBMS. Experiments show that SQL overhead is significant for small data sets, but relatively low for large data sets, whereas export times become a bottleneck for C++.