Leveraging network structure for incremental document clustering

  • Authors:
  • Tieyun Qian;Jianfeng Si;Qing Li;Qian Yu

  • Affiliations:
  • State Key Laboratory of Software Engineering, Wuhan University, Wuhan, China and State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China;Department of Computer Science, City University of Hong Kong, Hong Kong, China;Department of Computer Science, City University of Hong Kong, Hong Kong, China;State Key Laboratory of Software Engineering, Wuhan University, Wuhan, China

  • Venue:
  • APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recent studies have shown that link-based clustering methods can significantly improve the performance of content-based clustering. However, most previous algorithms are developed for fixed data sets, and are not applicable to the dynamic environments such as data warehouse and online digital library. In this paper, we introduce a novel approach which leverages the network structure for incremental clustering. Under this framework, both the link and content information are incorporated to determine the host cluster of a new document. The combination of two types of information ensures a promising performance of the clustering results. Furthermore, the status of core members is used to quickly determine whether to split or merge a new cluster. This filtering process eliminates the unnecessary and time-consuming checks of textual similarity on the whole corpus, and thus greatly speeds up the entire procedure. We evaluate our proposed approach on several real-world publication data sets and conduct an extensive comparison with both the classic content based and the recent link based algorithms. The experimental results demonstrate the effectiveness and efficiency of our method.