Modeling Chinese documents with topical word-character models

  • Authors:
  • Wei Hu;Nobuyuki Shimizu;Hiroshi Nakagawa;Huanye Sheng

  • Affiliations:
  • Shanghai Jiao Tong University, Shanghai, China;The University of Tokyo, Tokyo, Japan;The University of Tokyo, Tokyo, Japan;Shanghai Jiao Tong University, Shanghai, China

  • Venue:
  • COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
  • Year:
  • 2008
  • On collocations and topic models

    ACM Transactions on Speech and Language Processing (TSLP) - Special issue on multiword expressions: From theory to practice and use, part 2

Quantified Score

Hi-index 0.00

Visualization

Abstract

As Chinese text is written without word boundaries, effectively recognizing Chinese words is like recognizing collocations in English, substituting characters for words and words for collocations. However, existing topical models that involve collocations have a common limitation. Instead of directly assigning a topic to a collocation, they take the topic of a word within the collocation as the topic of the whole collocation. This is unsatisfactory for topical modeling of Chinese documents. Thus, we propose a topical word-character model (TWC), which allows two distinct types of topics: word topic and character topic. We evaluated TWC both qualitatively and quantitatively to show that it is a powerful and a promising topic model.