Mining common topics from multiple asynchronous text streams

  • Authors:
  • Xiang Wang;Kai Zhang;Xiaoming Jin;Dou Shen

  • Affiliations:
  • Tsinghua University, Beijing, China;Tsinghua University, Beijing, China;Tsinghua University, Beijing, China;Microsoft Adcenter Labs, One Microsoft Way, Redmond, WA

  • Venue:
  • Proceedings of the Second ACM International Conference on Web Search and Data Mining
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Text streams are becoming more and more ubiquitous, in the forms of news feeds, weblog archives and so on, which result in a large volume of data. An effective way to explore the semantic as well as temporal information in text streams is topic mining, which can further facilitate other knowledge discovery procedures. In many applications, we are facing multiple text streams which are related to each other and share common topics. The correlation among these streams can provide more meaningful and comprehensive clues for topic mining than those from each individual stream. However, it is nontrivial to explore the correlation with the existence of asynchronism among multiple streams, i.e. documents from different streams about the same topic may have different timestamps, which remains unsolved in the context of topic mining. In this paper, we formally address this problem and put forward a novel algorithm based on the generative topic model. Our algorithm consists of two alternate steps: the first step extracts common topics from multiple streams based on the adjusted timestamps by the second step; the second step adjusts the timestamps of the documents according to the time distribution of the discovered topics by the first step. We perform these two steps alternately and a monotone convergence of our objective function is guaranteed. The effectiveness and advantage of our approach were justified by extensive empirical studies on two real data sets consisting of six research paper streams and two news article streams, respectively.