The Journal of Machine Learning Research
Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval
ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
On collocations and topic models
ACM Transactions on Speech and Language Processing (TSLP) - Special issue on multiword expressions: From theory to practice and use, part 2
Hi-index | 0.00 |
As Chinese text is written without word boundaries, effectively recognizing Chinese words is like recognizing collocations in English, substituting characters for words and words for collocations. However, existing topical models that involve collocations have a common limitation. Instead of directly assigning a topic to a collocation, they take the topic of a word within the collocation as the topic of the whole collocation. This is unsatisfactory for topical modeling of Chinese documents. Thus, we propose a topical word-character model (TWC), which allows two distinct types of topics: word topic and character topic. We evaluated TWC both qualitatively and quantitatively to show that it is a powerful and a promising topic model.