Fully distributed EM for very large datasets

  • Authors:
  • Jason Wolfe;Aria Haghighi;Dan Klein

  • Affiliations:
  • University of California, Berkeley, CA;University of California, Berkeley, CA;University of California, Berkeley, CA

  • Venue:
  • Proceedings of the 25th international conference on Machine learning
  • Year:
  • 2008

Quantified Score

Hi-index 0.01

Visualization

Abstract

In EM and related algorithms, E-step computations distribute easily, because data items are independent given parameters. For very large data sets, however, even storing all of the parameters in a single node for the M-step can be impractical. We present a framework that fully distributes the entire EM procedure. Each node interacts only with parameters relevant to its data, sending messages to other nodes along a junction-tree topology. We demonstrate improvements over a MapReduce topology, on two tasks: word alignment and topic modeling.