Classification of large data sets with mixture models via sufficient EM

Authors:
P. M. Steiner;M. Hudec
Affiliations:
Institute for Advanced Studies, Stumpergasse 56, A-1060 Vienna, Austria;Institute for Scientific Computing, University of Vienna, A-1010 Vienna, Austria
Venue:
Computational Statistics & Data Analysis
Year:
2007

Citing 7
Cited 2

Introduction to statistical pattern recognition (2nd ed.)

Introduction to statistical pattern recognition (2nd ed.)
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Sampling and Subsampling for Cluster Analysis in Data Mining: With Applications to Sky Survey Data

Data Mining and Knowledge Discovery
Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
STING: A Statistical Information Grid Approach to Spatial Data Mining

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Data squashing: constructing summary data sets

Handbook of massive data sets

Editorial: Advances in Mixture Models

Computational Statistics & Data Analysis
Model-based clustering of high-dimensional data: A review

Computational Statistics & Data Analysis

Quantified Score

Hi-index	0.03

Visualization

Abstract

For the classification of very large data sets with a mixture model approach a two-step strategy for the estimation of the mixture is proposed. In the first step data are scaled down using compression techniques. Data compression consists of clustering the single observations into a medium number of groups and the representation of each group by a prototype, i.e. a triple of sufficient statistics (mean vector, covariance matrix, number of observations compressed). In the second step the mixture is estimated by applying an adapted EM algorithm (called sufficient EM) to the sufficient statistics of the compressed data. The estimated mixture allows the classification of observations according to their maximum posterior probability of component membership. The performance of sufficient EM in clustering a real data set from a web-usage mining application is compared to standard EM and the TwoStep clustering algorithm as implemented in SPSS. It turns out that the algorithmic efficiency of the sufficient EM algorithm is much more higher than for standard EM. While the TwoStep algorithm is even faster the results show a lack of stability.