Smoothing LDA model for text categorization

  • Authors:
  • Wenbo Li;Le Sun;Yuanyong Feng;Dakun Zhang

  • Affiliations:
  • Institute of Software, Chinese Academy of Sciences, Beijing, China and Graduate University of Chinese Academy of Sciences, Beijing, China;Institute of Software, Chinese Academy of Sciences, Beijing, China;Institute of Software, Chinese Academy of Sciences, Beijing, China and Graduate University of Chinese Academy of Sciences, Beijing, China;Institute of Software, Chinese Academy of Sciences, Beijing, China and Graduate University of Chinese Academy of Sciences, Beijing, China

  • Venue:
  • AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Latent Dirichlet Allocation (LDA) is a document level language model. In general, LDA employ the symmetry Dirichlet distribution as prior of the topic-words' distributions to implement model smoothing. In this paper, we propose a data-driven smoothing strategy in which probability mass is allocated from smoothing-data to latent variables by the intrinsic inference procedure of LDA. In such a way, the arbitrariness of choosing latent variables' priors for the multi-level graphical model is overcome. Following this data-driven strategy, two concrete methods, Laplacian smoothing and Jelinek-Mercer smoothing, are employed to LDA model. Evaluations on different text categorization collections show data-driven smoothing can significantly improve the performance in balanced and unbalanced corpora.