Accelerated EM-based clustering of large data sets

Authors:
Jakob J. Verbeek;Jan R. Nunnink;Nikos Vlassis
Affiliations:
INRIA Rhone-Alpes, Montbonnot Saint-Martin, France 38330;Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands 1098 SJ;Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands 1098 SJ
Venue:
Data Mining and Knowledge Discovery
Year:
2006

Citing 0
Cited 7

A fast algorithm for robust mixtures in the presence of measurement errors

IEEE Transactions on Neural Networks
A fast implementation of the EM algorithm for mixture of multinomials

ADMA'06 Proceedings of the Second international conference on Advanced Data Mining and Applications
State of the art in photon density estimation

ACM SIGGRAPH 2012 Courses
Progressive expectation-maximization for hierarchical volumetric photon mapping

EGSR'11 Proceedings of the Twenty-second Eurographics conference on Rendering
A novel split-and-merge algorithm for hierarchical clustering of Gaussian mixture models

Applied Intelligence
Approximate gaussian mixtures for large scale vocabularies

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part III
A comparative study of novel robust clustering algorithms

Intelligent Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Motivated by the poor performance (linear complexity) of the EM algorithm in clustering large data sets, and inspired by the successful accelerated versions of related algorithms like k-means, we derive an accelerated variant of the EM algorithm for Gaussian mixtures that: (1) offers speedups that are at least linear in the number of data points, (2) ensures convergence by strictly increasing a lower bound on the data log-likelihood in each learning step, and (3) allows ample freedom in the design of other accelerated variants. We also derive a similar accelerated algorithm for greedy mixture learning, where very satisfactory results are obtained. The core idea is to define a lower bound on the data log-likelihood based on a grouping of data points. The bound is maximized by computing in turn (i) optimal assignments of groups of data points to the mixture components, and (ii) optimal re-estimation of the model parameters based on average sufficient statistics computed over groups of data points. The proposed method naturally generalizes to mixtures of other members of the exponential family. Experimental results show the potential of the proposed method over other state-of-the-art acceleration techniques.