Scalable inference in latent variable models

  • Authors:
  • Amr Ahmed;Moahmed Aly;Joseph Gonzalez;Shravan Narayanamurthy;Alexander J. Smola

  • Affiliations:
  • Yahoo! Research, Santa Clara, CA, USA;Yahoo! Research, Santa Clara, CA, USA;Carnegie Mellon University, Santa Clara, CA, USA;Yahoo! Research, Santa Clara, USA;Yahoo! Research, Santa Clara, USA

  • Venue:
  • Proceedings of the fifth ACM international conference on Web search and data mining
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Latent variable techniques are pivotal in tasks ranging from predicting user click patterns and targeting ads to organizing the news and managing user generated content. Latent variable techniques like topic modeling, clustering, and subspace estimation provide substantial insight into the latent structure of complex data with little or no external guidance making them ideal for reasoning about large-scale, rapidly evolving datasets. Unfortunately, due to the data dependencies and global state introduced by latent variables and the iterative nature of latent variable inference, latent-variable techniques are often prohibitively expensive to apply to large-scale, streaming datasets. In this paper we present a scalable parallel framework for efficient inference in latent variable models over streaming web-scale data. Our framework addresses three key challenges: 1) synchronizing the global state which includes global latent variables (e.g., cluster centers and dictionaries); 2) efficiently storing and retrieving the large local state which includes the data-points and their corresponding latent variables (e.g., cluster membership); and 3) sequentially incorporating streaming data (e.g., the news). We address these challenges by introducing: 1) a novel delta-based aggregation system with a bandwidth-efficient communication protocol; 2) schedule-aware out-of-core storage; and 3) approximate forward sampling to rapidly incorporate new data. We demonstrate state-of-the-art performance of our framework by easily tackling datasets two orders of magnitude larger than those addressed by the current state-of-the-art. Furthermore, we provide an optimized and easily customizable open-source implementation of the framework1.