Classification of large data sets with mixture models via sufficient EM

  • Authors:
  • P. M. Steiner;M. Hudec

  • Affiliations:
  • Institute for Advanced Studies, Stumpergasse 56, A-1060 Vienna, Austria;Institute for Scientific Computing, University of Vienna, A-1010 Vienna, Austria

  • Venue:
  • Computational Statistics & Data Analysis
  • Year:
  • 2007

Quantified Score

Hi-index 0.03

Visualization

Abstract

For the classification of very large data sets with a mixture model approach a two-step strategy for the estimation of the mixture is proposed. In the first step data are scaled down using compression techniques. Data compression consists of clustering the single observations into a medium number of groups and the representation of each group by a prototype, i.e. a triple of sufficient statistics (mean vector, covariance matrix, number of observations compressed). In the second step the mixture is estimated by applying an adapted EM algorithm (called sufficient EM) to the sufficient statistics of the compressed data. The estimated mixture allows the classification of observations according to their maximum posterior probability of component membership. The performance of sufficient EM in clustering a real data set from a web-usage mining application is compared to standard EM and the TwoStep clustering algorithm as implemented in SPSS. It turns out that the algorithmic efficiency of the sufficient EM algorithm is much more higher than for standard EM. While the TwoStep algorithm is even faster the results show a lack of stability.