Scalable Model-Based Clustering for Large Databases Based on Data Summarization

Authors:
Huidong Jin;Man-Leung Wong;K. -S. Leung
Affiliations:
IEEE;IEEE;IEEE
Venue:
IEEE Transactions on Pattern Analysis and Machine Intelligence
Year:
2005

Citing 17
Cited 7

Bayesian classification (AutoClass): theory and results

Advances in knowledge discovery and data mining
Algorithms for Model-Based Gaussian Hierarchical Clustering

SIAM Journal on Scientific Computing
Compressed data cubes for OLAP aggregate query approximation on continuous dimensions

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Density biased sampling: an improved method for data mining and clustering

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Very fast EM-based mixture model clustering using multiresolution kd-trees

Proceedings of the 1998 conference on Advances in neural information processing systems II
Data mining: concepts and techniques

Data mining: concepts and techniques
An experimental comparison of model-based clustering methods

Machine Learning
Unsupervised Learning of Finite Mixture Models

IEEE Transactions on Pattern Analysis and Machine Intelligence
Accelerating EM for Large Databases

Machine Learning
BIRCH: A New Data Clustering Algorithm and Its Applications

Data Mining and Knowledge Discovery
Mining Very Large Databases

Computer
Transformation-Invariant Clustering Using the EM Algorithm

IEEE Transactions on Pattern Analysis and Machine Intelligence
Scalable model-based clustering algorithms for large databases and their applications

Scalable model-based clustering algorithms for large databases and their applications
Scalable Model-based Clustering by Working on Data Summaries

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Clustering by committee

Clustering by committee
Scalable model-based cluster analysis using clustering features

Pattern Recognition
Boltzmann machine learning with the latent maximum entropy principle

UAI'03 Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence

Analysis of breast feeding data using data mining methods

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Knowledge Discovery from Honeypot Data for Monitoring Malicious Attacks

AI '08 Proceedings of the 21st Australasian Joint Conference on Artificial Intelligence: Advances in Artificial Intelligence
A scalable framework for cluster ensembles

Pattern Recognition
Combining evolutionary and stochastic gradient techniques for system identification

Journal of Computational and Applied Mathematics
Practical issues on privacy-preserving health data mining

PAKDD'07 Proceedings of the 2007 international conference on Emerging technologies in knowledge discovery and data mining
Mining massive datasets by an unsupervised parallel clustering on a GRID: Novel algorithms and case study

Future Generation Computer Systems
Data summarization for network traffic monitoring

Journal of Network and Computer Applications

Quantified Score

Hi-index	0.15

Visualization

Abstract

The scalability problem in data mining involves the development of methods for handling large databases with limited computational resources such as memory and computation time. In this paper, two scalable clustering algorithms, bEMADS and gEMADS, are presented based on the Gaussian mixture model. Both summarize data into subclusters and then generate Gaussian mixtures from their data summaries. Their core algorithm, EMADS, is defined on data summaries and approximates the aggregate behavior of each subcluster of data under the Gaussian mixture model. EMADS is provably convergent. Experimental results substantiate that both algorithms can run several orders of magnitude faster than expectation-maximization with little loss of accuracy.