An expectation-maximization algorithm working on data summary

Authors:
Huidong Jin;Kwong-Sak Leung;Man-Leung Wong
Affiliations:
Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong;Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong;Department of Information Systems, Lingnan College, Tuen Mun, Hong Kong
Venue:
Second international workshop on Intelligent systems design and application
Year:
2002

Citing 7
Cited 1

BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Advances in knowledge discovery and data mining

Advances in knowledge discovery and data mining
Compressed data cubes for OLAP aggregate query approximation on continuous dimensions

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Density biased sampling: an improved method for data mining and clustering

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Very fast EM-based mixture model clustering using multiresolution kd-trees

Proceedings of the 1998 conference on Advances in neural information processing systems II
Visualization of navigation patterns on a Web site using model-based clustering

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Accelerating EM for Large Databases

Machine Learning

Scaling-Up Model-Based Clustering Algorithm by Working on Clustering Features

IDEAL '02 Proceedings of the Third International Conference on Intelligent Data Engineering and Automated Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scalable cluster analysis addresses the problem of processing large data sets with limited resources, e.g., memory and computation time. A data summarization or sampling procedure is an essential step of most scalable algorithms. It forms a compact representation of the data. Based on it, traditional clustering algorithms can process large data sets efficiently. However, there is little work on how to effectively perform cluster analysis on data summaries. From the principle of the general expectation-maximization algorithm, we propose a model-based clustering algorithm to make better use of these data summaries in this paper. The proposed EMACF (Expectation-Maximization Algorithm on Clustering Features) algorithm employs data summary features including weight, mean, and variance explicitly. We prove that EMACF converges to a local maximum likelihood value. The computation time of EMACF is linear with the number of data summaries instead of the number of data items, and thus can be integrated with any efficient data summarization procedure to construct a scalable clustering algorithm.