Accelerating EM for Large Databases

Authors:
Bo Thiesson;Christopher Meek;David Heckerman
Affiliations:
Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA. thiesson@microsoft.com;Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA. meek@microsoft.com;Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA. heckerma@microsoft.com
Venue:
Machine Learning
Year:
2001

Citing 9
Cited 16

Soft competitive adaptation: neural network learning algorithms based on fitting statistical mixtures

Soft competitive adaptation: neural network learning algorithms based on fitting statistical mixtures
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Bayesian classification (AutoClass): theory and results

Advances in knowledge discovery and data mining
Efficient Approximations for the MarginalLikelihood of Bayesian Networks with Hidden Variables

Machine Learning - Special issue on learning with probabilistic representations
A view of the EM algorithm that justifies incremental, sparse, and other variants

Proceedings of the NATO Advanced Study Institute on Learning in graphical models
Very fast EM-based mixture model clustering using multiresolution kd-trees

Proceedings of the 1998 conference on Advances in neural information processing systems II
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
On-line EM Algorithm for the Normalized Gaussian Network

Neural Computation
Fast learning from sparse data

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence

On the choice of the number of blocks with the incremental EM algorithm for the fitting of normal mixtures

Statistics and Computing
An expectation-maximization algorithm working on data summary

Second international workshop on Intelligent systems design and application
Scalable Model-based Clustering by Working on Data Summaries

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Scalable Model-Based Clustering for Large Databases Based on Data Summarization

IEEE Transactions on Pattern Analysis and Machine Intelligence
Convergence Theorems for Generalized Alternating Minimization Procedures

The Journal of Machine Learning Research
A Scalable Framework For Segmenting Magnetic Resonance Images

Journal of Signal Processing Systems
Sampling-based estimators for subset-based queries

The VLDB Journal — The International Journal on Very Large Data Bases
Global and componentwise extrapolations for accelerating training of Bayesian networks and conditional random fields

Data Mining and Knowledge Discovery
Scalable model-based cluster analysis using clustering features

Pattern Recognition
Active curve axis Gaussian mixture models

Pattern Recognition
Fast online estimation of the joint probability distribution

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Data compression by volume prototypes for streaming data

Pattern Recognition
A fast implementation of the EM algorithm for mixture of multinomials

ADMA'06 Proceedings of the Second international conference on Advanced Data Mining and Applications
Fast estimation of nonparametric kernel density through PDDP, and its application in texture synthesis

VoCS'08 Proceedings of the 2008 international conference on Visions of Computer Science: BCS International Academic Conference
A probabilistic model of active learning with multiple noisy oracles

Neurocomputing
A fast convergence clustering algorithm merging MCMC and EM methods

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

The EM algorithm is a popular method for parameter estimation in a variety of problems involving missing data. However, the EM algorithm often requires significant computational resources and has been dismissed as impractical for large databases. We present two approaches that significantly reduce the computational cost of applying the EM algorithm to databases with a large number of cases, including databases with large dimensionality. Both approaches are based on partial E-steps for which we can use the results of Neal and Hinton (In Jordan, M. (Ed.), Learning in Graphical Models, pp. 355–371. The Netherlands: Kluwer Academic Publishers) to obtain the standard convergence guarantees of EM. The first approach is a version of the incremental EM algorithm, described in Neal and Hinton (1998), which cycles through data cases in blocks. The number of cases in each block dramatically effects the efficiency of the algorithm. We provide a method for selecting a near optimal block size. The second approach, which we call lazy EM, will, at scheduled iterations, evaluate the significance of each data case and then proceed for several iterations actively using only the significant cases. We demonstrate that both methods can significantly reduce computational costs through their application to high-dimensional real-world and synthetic mixture modeling problems for large databases.