Topic-based language models using Dirichlet Mixtures

  • Authors:
  • Kugatsu Sadamitsu;Takuya Mishina;Mikio Yamamoto

  • Affiliations:
  • Graduate School of Systems and Information Engineering, University of Tsukuba, Tsukuba, 305-8573 Japan;IBM Research, Tokyo Research Laboratory, Yamato, 242-8502 Japan;Graduate School of Systems and Information Engineering, University of Tsukuba, Tsukuba, 305-8573 Japan

  • Venue:
  • Systems and Computers in Japan
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

We propose a generative text model using Dirichlet Mixtures as a distribution for parameters of a multinomial distribution, whose compound distribution is Polya Mixtures, and show that the model exhibits high performance in application to statistical language models. In this paper, we discuss some methods for estimating parameters of Dirichlet Mixtures and for estimating the expectation values of the a posteriori distribution needed for adaptation, and then compare them with two previous text models. The first conventional model is the Mixture of Unigrams, which is often used for incorporating topics into statistical language models. The second one is LDA (Latent Dirichlet Allocation), a typical generative text model. In an experiment using document probabilities and dynamic adaptation of n-gram models for newspaper articles, we show that the proposed model, in comparison with the two previous models, can achieve a lower perplexity at low mixture numbers. © 2007 Wiley Periodicals, Inc. Syst Comp Jpn, 38(12): 76– 85, 2007; Published online in Wiley InterScience (). DOI 10.1002-scj.20629