A fast implementation of the EM algorithm for mixture of multinomials

Authors:
Jan Peter Patist
Affiliations:
Free University Amsterdam, Amsterdam, The Netherlands
Venue:
ADMA'06 Proceedings of the Second international conference on Advanced Data Mining and Applications
Year:
2006

Citing 9
Cited 1

BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
A view of the EM algorithm that justifies incremental, sparse, and other variants

Learning in graphical models
Very fast EM-based mixture model clustering using multiresolution kd-trees

Proceedings of the 1998 conference on Advances in neural information processing systems II
Visualization of navigation patterns on a Web site using model-based clustering

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Probabilistic modeling of transaction data with applications to profiling, visualization, and prediction

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Accelerating EM for Large Databases

Machine Learning
WebKDD 2005: web mining and web usage analysis post-workshop report

ACM SIGKDD Explorations Newsletter
Accelerated EM-based clustering of large data sets

Data Mining and Knowledge Discovery
An overview of web data clustering practices

EDBT'04 Proceedings of the 2004 international conference on Current Trends in Database Technology

Complexity control in a mixture model by the Hardy-Weinberg equilibrium

Computational Statistics & Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose several simple techniques which dramatically reduce both the memory demand and computational effort in building multinomial mixture models using the EM algorithm. The reason of the dramatic improvement in performance is that the techniques make use of certain properties of the data. These properties are: the data is sparse and there are many repeating records. We claim that particular sources of data consistently satisfy these properties. Excellent examples are Clickstream and retail data which are very sparse and consist of many repititions. Using simple techniques huge speed-ups and compression rates, on real life clickstream data sets, are observed compared to the standard implementation of the EM.