Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval

  • Authors:
  • Xuerui Wang;Andrew McCallum;Xing Wei

  • Affiliations:
  • -;-;-

  • Venue:
  • ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Most topic models, such as latent Dirichlet allocation, rely on the bag-of-words assumption. However, word order and phrases are often critical to capturing the meaning of text in many text mining tasks. This paper presents topical n-grams, a topic model that discovers topics as well as topical phrases. The probabilistic model generates words in their textual order by, for each word, first sampling a topic, then sampling its status as a unigram or bigram, and then sampling the word from a topic-specific unigram or bigram distribution. Thus our model can model "white house" as a special meaning phrase in the `politics' topic, but not in the `real estate' topic. Successive bigrams form longer phrases. We present experiments showing meaningful phrases and more interpretable topics from the NIPS data and improved information retrieval performance on a TREC collection.