User-related tag expansion for web document clustering

  • Authors:
  • Peng Li;Bin Wang;Wei Jin;Yachao Cui

  • Affiliations:
  • Institute of Computing Technology, Chinese Academy of Sciences, China and Graduate School of the Chinese Academy of Sciences, Beijing, China;Institute of Computing Technology, Chinese Academy of Sciences, China;Department of Computer Science, North Dakota State University;Institute of Computing Technology, Chinese Academy of Sciences, China and Graduate School of the Chinese Academy of Sciences, Beijing, China

  • Venue:
  • ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

As high quality descriptors of web page semantics, social annotations or tags have been used for web document clustering and achieved promising results. However, most web pages have few tags(less than 10). This sparsity seriously limits the usage of tags on clustering. In this work, we propose a user-related tag expansion method to overcome the problem, which incorporates additional useful tags into the original tag document by utilizing user tagging as background knowledge. Unfortunately, simply adding tags may cause topic drift, i.e., the dominant topic(s) of the original document may be changed. This problem is addressed in this research by designing a novel generative model called Folk-LDA, which jointly models original and expanded tags as independent observations. Experimental results show that (1)Our user-related tag expansion method can be effectively applied to over 90% tagged web documents; (2)Folk-LDA can alleviate the topic drift in expansion, especially for those topic-specific documents; (3) Compared to word-based clustering, our approach using only tags achieves a statistically significant increase of 39% on F1 score while reducing 76% terms involved in computation at best.