BIRCH: an efficient data clustering method for very large databases
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
The KDD process for extracting useful knowledge from volumes of data
Communications of the ACM
Integrating association rule mining with relational database systems: alternatives and implications
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
DBMiner: interactive mining of multiple-level knowledge in relational databases
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
A unifying review of linear Gaussian models
Neural Computation
Accelerating exact k-means algorithms with geometric reasoning
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
NonStop SQL/MX primitives for knowledge discovery
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Finding generalized projected clusters in high dimensional spaces
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
SQLEM: fast clustering in SQL using the EM algorithm
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Scalability for clustering algorithms revisited
ACM SIGKDD Explorations Newsletter
SQL database primitives for decision tree classifiers
Proceedings of the tenth international conference on Information and knowledge management
FREM: fast and robust EM clustering for large data sets
Proceedings of the eleventh international conference on Information and knowledge management
Alternatives to the k-means algorithm that find better clusterings
Proceedings of the eleventh international conference on Information and knowledge management
An Extension to SQL for Mining Association Rules
Data Mining and Knowledge Discovery
MSQL: A Query Language for Database Mining
Data Mining and Knowledge Discovery
The LBG-U Method for Vector Quantization – an Improvement over LBGInspired from Neural Networks
Neural Processing Letters
Ad Hoc Association Rule Mining as SQL3 Queries
ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Spreadsheets in RDBMS for OLAP
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Clustering binary data streams with K-means
DMKD '03 Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Clustering gene expression data in SQL using locally adaptive metrics
DMKD '03 Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Efficient Disk-Based K-Means Clustering for Relational Databases
IEEE Transactions on Knowledge and Data Engineering
Programming the K-means clustering algorithm in SQL
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
ATLAS: a small but complete SQL extension for data mining and data streams
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
On convergence properties of the em algorithm for gaussian mixtures
Neural Computation
Building statistical models and scoring with UDFs
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Efficient online mining of large databases
International Journal of Business Information Systems
Data mining for decision support in multiple-model system identification
SMO'06 Proceedings of the 6th WSEAS International Conference on Simulation, Modelling and Optimization
Models for association rules based on clustering and correlation
Intelligent Data Analysis
I/O scalable Bregman co-clustering
PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Database systems research on data mining
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Evaluating association rules and decision trees to predict multiple target attributes
Intelligent Data Analysis
A data mining system based on SQL queries and UDFs for relational databases
Proceedings of the 20th ACM international conference on Information and knowledge management
Combining two data mining methods for system identification
EG-ICE'06 Proceedings of the 13th international conference on Intelligent Computing in Engineering and Architecture
Proceedings of the VLDB Endowment
The MADlib analytics library: or MAD skills, the SQL
Proceedings of the VLDB Endowment
SQL based cardiovascular ultrasound image classification
International Journal of Data Mining and Bioinformatics
Clustering cubes with binary dimensions in one pass
Proceedings of the sixteenth international workshop on Data warehousing and OLAP
Can we analyze big data inside a DBMS?
Proceedings of the sixteenth international workshop on Data warehousing and OLAP
Hi-index | 0.00 |
Integrating data mining algorithms with a relational DBMS is an important problem for database programmers. We introduce three SQL implementations of the popular K-means clustering algorithm to integrate it with a relational DBMS: 1) a straightforward translation of K-means computations into SQL, 2) an optimized version based on improved data organization, efficient indexing, sufficient statistics, and rewritten queries, and 3) an incremental version that uses the optimized version as a building block with fast convergence and automated reseeding. We experimentally show the proposed K-means implementations work correctly and can cluster large data sets. We identify which K-means computations are more critical for performance. The optimized and incremental K-means implementations exhibit linear scalability. We compare K-means implementations in SQL and C++ with respect to speed and scalability and we also study the time to export data sets outside of the DBMS. Experiments show that SQL overhead is significant for small data sets, but relatively low for large data sets, whereas export times become a bottleneck for C++.