Fully distributed EM for very large datasets

Authors:
Jason Wolfe;Aria Haghighi;Dan Klein
Affiliations:
University of California, Berkeley, CA;University of California, Berkeley, CA;University of California, Berkeley, CA
Venue:
Proceedings of the 25th international conference on Machine learning
Year:
2008

Citing 7
Cited 12

Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference

Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference
Latent dirichlet allocation

The Journal of Machine Learning Research
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Robust probabilistic inference in distributed systems

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Distributed EM algorithms for density estimation and clustering in sensor networks

IEEE Transactions on Signal Processing

MapReduce optimization using regulated dynamic prioritization

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Evaluating SPLASH-2 Applications Using MapReduce

APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
Ranking and semi-supervised classification on large scale graphs using map-reduce

TextGraphs-4 Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing
Distributed Algorithms for Topic Models

The Journal of Machine Learning Research
Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce

Proceedings of the 19th international conference on World wide web
Design patterns for efficient graph algorithms in MapReduce

Proceedings of the Eighth Workshop on Mining and Learning with Graphs
Distributed asynchronous online learning for natural language processing

CoNLL '10 Proceedings of the Fourteenth Conference on Computational Natural Language Learning
Scaling the mobile millennium system in the cloud

Proceedings of the 2nd ACM Symposium on Cloud Computing
A Decentralized and Robust Approach to Estimating a Probabilistic Mixture Model for Structuring Distributed Data

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Mr. LDA: a flexible large scale topic modeling package using variational inference in MapReduce

Proceedings of the 21st international conference on World Wide Web
Using natural language to integrate, evaluate, and optimize extracted knowledge bases

Proceedings of the 2013 workshop on Automated knowledge base construction
Robust estimation of a global Gaussian mixture by decentralized aggregations of local models

Web Intelligence and Agent Systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

In EM and related algorithms, E-step computations distribute easily, because data items are independent given parameters. For very large data sets, however, even storing all of the parameters in a single node for the M-step can be impractical. We present a framework that fully distributes the entire EM procedure. Each node interacts only with parameters relevant to its data, sending messages to other nodes along a junction-tree topology. We demonstrate improvements over a MapReduce topology, on two tasks: word alignment and topic modeling.