Topic detection by topic model induced distance using biased initiation

Authors:
Yonghui Wu;Yuxin Ding;Xiaolong Wang;Jun Xu
Affiliations:
Harbin Institute of Technology, Harbin, People's Republic of China and Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology, Shenzhen Graduate School, Shenzhe ...;Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology, Shenzhen Graduate School, Shenzhen, People's Republic of China;Harbin Institute of Technology, Harbin, People's Republic of China and Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology, Shenzhen Graduate School, Shenzhe ...;Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology, Shenzhen Graduate School, Shenzhen, People's Republic of China
Venue:
AST/UCMA/ISA/ACN'10 Proceedings of the 2010 international conference on Advances in computer science and information technology
Year:
2010

Citing 11
Cited 0

Exploring the similarity space

ACM SIGIR Forum
On-line new event detection and tracking

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Data clustering: a review

ACM Computing Surveys (CSUR)
A cross-collection mixture model for comparative text mining

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Discovering evolutionary theme patterns from text: an exploration of temporal text mining

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Topic analysis using a finite mixture model

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
Topic discovery based on text mining techniques

Information Processing and Management: an International Journal
Hot Topic Extraction Based on Timeline Analysis and Multidimensional Sentence Modeling

IEEE Transactions on Knowledge and Data Engineering
Topic segmentation with shared topic detection and alignment of multiple documents

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
A sentence level probabilistic model for evolutionary theme pattern mining from news corpora

Proceedings of the 2009 ACM symposium on Applied Computing
LexRank: graph-based lexical centrality as salience in text summarization

Journal of Artificial Intelligence Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering is widely used in topic detection task. However, the vector space model based distance, such as cosine-like distance, will get a low precision and recall when the corpus contains many related topics. In this paper, we propose a new distance measure method: the Topic Model (TM) induced distance. Assuming that the distribution of word is different in each topic, the documents can be treated as a sample of the mixture of k topic models, which can be estimated using expectation maximization (EM). A biased initiation method is proposed in this paper for topic decomposition using EM, which will generate a converged matrix for the generation of TM induced distance. The collections of web news are clustered into classes using this TM distance. A series of experiments are described on a corpus containing 5033 web news from 30 topics. K-means clustering is processed on test set with different topic numbers. A comparison of clustering result using the TM induced distance and the traditional cosine-like distance are given. The experiment results show that the proposed topic decomposition method using biased initiation is effective than the topic decomposition using random values. The TM induced distance will generate more topical groups than the VS model based cosine-like distance. In the web news collections containing related topics, the TM induced distance can achieve a better precision and recall.