Scalable Model-based Clustering by Working on Data Summaries

Authors:
Huidong Jin;Man-Leung Wong;Kwong-Sak Leung
Affiliations:
-;-;-
Venue:
ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Year:
2003

Citing 14
Cited 4

Bayesian classification (AutoClass): theory and results

Advances in knowledge discovery and data mining
Algorithms for Model-Based Gaussian Hierarchical Clustering

SIAM Journal on Scientific Computing
Squashing flat files flatter

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Compressed data cubes for OLAP aggregate query approximation on continuous dimensions

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Density biased sampling: an improved method for data mining and clustering

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Very fast EM-based mixture model clustering using multiresolution kd-trees

Proceedings of the 1998 conference on Advances in neural information processing systems II
A general probabilistic framework for clustering individuals and objects

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
An experimental comparison of model-based clustering methods

Machine Learning
A robust and scalable clustering algorithm for mixed type attributes in large database environment

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Accelerating EM for Large Databases

Machine Learning
BIRCH: A New Data Clustering Algorithm and Its Applications

Data Mining and Knowledge Discovery
Efficient and Effective Clustering Methods for Spatial Data Mining

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Scalable model-based clustering algorithms for large databases and their applications

Scalable model-based clustering algorithms for large databases and their applications
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques

A delivery framework for health data mining and analytics

ACSC '05 Proceedings of the Twenty-eighth Australasian conference on Computer Science - Volume 38
Scalable Model-Based Clustering for Large Databases Based on Data Summarization

IEEE Transactions on Pattern Analysis and Machine Intelligence
Adaptive, convergent, and diversified archiving strategy for multiobjective evolutionary algorithms

Expert Systems with Applications: An International Journal
Identifying risk groups associated with colorectal cancer

Data Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

The scalability problem in data mining involves the developmentof methods for handling large databases withlimited computational resources. In this paper, we presenta two-phase scalable model-based clustering framework:First, a large data set is summed up into sub-clusters; Then,clusters are directly generated from the summary statisticsof sub-clusters by a specifically designed Expectation-Maximization(EM) algorithm. Taking example for Gaussianmixture models, we establish a provably convergentEM algorithm, EMADS, which embodies cardinality, mean,and covariance information of each sub-cluster explicitly.Combining with different data summarization procedures,EMADS is used to construct two clustering systems:gEMADS and bEMADS. The experimental results demonstratethat they run several orders of magnitude faster thanthe classic EM algorithm with little loss of accuracy. Theygenerate significantly better results than other model-basedclustering systems using similar computational resources.