Topic detection by topic model induced distance using biased initiation

  • Authors:
  • Yonghui Wu;Yuxin Ding;Xiaolong Wang;Jun Xu

  • Affiliations:
  • Harbin Institute of Technology, Harbin, People's Republic of China and Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology, Shenzhen Graduate School, Shenzhe ...;Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology, Shenzhen Graduate School, Shenzhen, People's Republic of China;Harbin Institute of Technology, Harbin, People's Republic of China and Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology, Shenzhen Graduate School, Shenzhe ...;Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology, Shenzhen Graduate School, Shenzhen, People's Republic of China

  • Venue:
  • AST/UCMA/ISA/ACN'10 Proceedings of the 2010 international conference on Advances in computer science and information technology
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Clustering is widely used in topic detection task. However, the vector space model based distance, such as cosine-like distance, will get a low precision and recall when the corpus contains many related topics. In this paper, we propose a new distance measure method: the Topic Model (TM) induced distance. Assuming that the distribution of word is different in each topic, the documents can be treated as a sample of the mixture of k topic models, which can be estimated using expectation maximization (EM). A biased initiation method is proposed in this paper for topic decomposition using EM, which will generate a converged matrix for the generation of TM induced distance. The collections of web news are clustered into classes using this TM distance. A series of experiments are described on a corpus containing 5033 web news from 30 topics. K-means clustering is processed on test set with different topic numbers. A comparison of clustering result using the TM induced distance and the traditional cosine-like distance are given. The experiment results show that the proposed topic decomposition method using biased initiation is effective than the topic decomposition using random values. The TM induced distance will generate more topical groups than the VS model based cosine-like distance. In the web news collections containing related topics, the TM induced distance can achieve a better precision and recall.