The generalized dirichlet distribution in enhanced topic detection

  • Authors:
  • Karla L. Caballero;Joel Barajas;Ram Akella

  • Affiliations:
  • University of California Santa Cruz, Santa Cruz, CA, USA;University of California Santa Cruz, Santa Cruz, CA, USA;University of California Santa Cruz, Santa Cruz, CA, USA

  • Venue:
  • Proceedings of the 21st ACM international conference on Information and knowledge management
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a new, robust and computationally efficient Hierarchical Bayesian model for effective topic correlation modeling. We model the prior distribution of topics by a Generalized Dirichlet distribution (GD) rather than a Dirichlet distribution as in Latent Dirichlet Allocation (LDA). We define this model as GD-LDA. This framework captures correlations between topics, as in the Correlated Topic Model (CTM) and Pachinko Allocation Model (PAM), and is faster to infer than CTM and PAM. GD-LDA is effective to avoid over-fitting as the number of topics is increased. As a tree model, it accommodates the most important set of topics in the upper part of the tree based on their probability mass. Thus, GD-LDA provides the ability to choose significant topics effectively. To discover topic relationships, we perform hyper-parameter estimation based on Monte Carlo EM Estimation. We provide results using Empirical Likelihood(EL) in 4 public datasets from TREC and NIPS. Then, we present the performance of GD-LDA in ad hoc information retrieval (IR) based on MAP, P@10, and Discounted Gain. We discuss an empirical comparison of the fitting time. We demonstrate significant improvement over CTM, LDA, and PAM for EL estimation. For all the IR measures, GD-LDA shows higher performance than LDA, the dominant topic model in IR. All these improvements with a small increase in fitting time than LDA, as opposed to CTM and PAM.