Introduction to statistical pattern recognition (2nd ed.)
Introduction to statistical pattern recognition (2nd ed.)
BIRCH: an efficient data clustering method for very large databases
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
CURE: an efficient clustering algorithm for large databases
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Sampling and Subsampling for Cluster Analysis in Data Mining: With Applications to Sky Survey Data
Data Mining and Knowledge Discovery
Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
STING: A Statistical Information Grid Approach to Spatial Data Mining
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Data squashing: constructing summary data sets
Handbook of massive data sets
Editorial: Advances in Mixture Models
Computational Statistics & Data Analysis
Model-based clustering of high-dimensional data: A review
Computational Statistics & Data Analysis
Hi-index | 0.03 |
For the classification of very large data sets with a mixture model approach a two-step strategy for the estimation of the mixture is proposed. In the first step data are scaled down using compression techniques. Data compression consists of clustering the single observations into a medium number of groups and the representation of each group by a prototype, i.e. a triple of sufficient statistics (mean vector, covariance matrix, number of observations compressed). In the second step the mixture is estimated by applying an adapted EM algorithm (called sufficient EM) to the sufficient statistics of the compressed data. The estimated mixture allows the classification of observations according to their maximum posterior probability of component membership. The performance of sufficient EM in clustering a real data set from a web-usage mining application is compared to standard EM and the TwoStep clustering algorithm as implemented in SPSS. It turns out that the algorithmic efficiency of the sufficient EM algorithm is much more higher than for standard EM. While the TwoStep algorithm is even faster the results show a lack of stability.