BIRCH: an efficient data clustering method for very large databases
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Integrating association rule mining with relational database systems: alternatives and implications
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
NonStop SQL/MX primitives for knowledge discovery
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Finding generalized projected clusters in high dimensional spaces
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
SQLEM: fast clustering in SQL using the EM algorithm
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
SQL database primitives for decision tree classifiers
Proceedings of the tenth international conference on Information and knowledge management
FREM: fast and robust EM clustering for large data sets
Proceedings of the eleventh international conference on Information and knowledge management
Ad Hoc Association Rule Mining as SQL3 Queries
ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Clustering binary data streams with K-means
DMKD '03 Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Clustering gene expression data in SQL using locally adaptive metrics
DMKD '03 Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Efficient Disk-Based K-Means Clustering for Relational Databases
IEEE Transactions on Knowledge and Data Engineering
Integrating K-Means Clustering with a Relational DBMS Using SQL
IEEE Transactions on Knowledge and Data Engineering
Vector and matrix operations programmed with UDFs in a relational DBMS
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
I/O scalable Bregman co-clustering
PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Data mining using relational database management systems
PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Hi-index | 0.00 |
Using SQL has not been considered an efficient and feasible way to implement data mining algorithms. Although this is true for many data mining, machine learning and statistical algorithms, this work shows it is feasible to get an efficient SQL implementation of the well-known K-means clustering algorithm that can work on top of a relational DBMS. The article emphasizes both correctness and performance. From a correctness point of view the article explains how to compute Euclidean distance, nearest-cluster queries and updating clustering results in SQL. From a performance point of view it is explained how to cluster large data sets defining and indexing tables to store and retrieve intermediate and final results, optimizing and avoiding joins, optimizing and simplifying clustering aggregations, and taking advantage of sufficient statistics. Experiments evaluate scalability with synthetic data sets varying size and dimensionality. The proposed K-means implementation can cluster large data sets and exhibits linear scalability.