User-related tag expansion for web document clustering

Authors:
Peng Li;Bin Wang;Wei Jin;Yachao Cui
Affiliations:
Institute of Computing Technology, Chinese Academy of Sciences, China and Graduate School of the Chinese Academy of Sciences, Beijing, China;Institute of Computing Technology, Chinese Academy of Sciences, China;Department of Computer Science, North Dakota State University;Institute of Computing Technology, Chinese Academy of Sciences, China and Graduate School of the Chinese Academy of Sciences, Beijing, China
Venue:
ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Year:
2011

Citing 19
Cited 0

A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Mining the peanut gallery: opinion extraction and semantic classification of product reviews

WWW '03 Proceedings of the 12th international conference on World Wide Web
Latent dirichlet allocation

The Journal of Machine Learning Research
Ontologies Improve Text Document Clustering

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Cluster-based retrieval using language models

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Learning to cluster web search results

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Tracking and summarizing news on a daily basis with Columbia's Newsblaster

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Can social bookmarking improve web search?

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Tag-based social interest discovery

Proceedings of the 17th international conference on World Wide Web
Exploring social annotations for information retrieval

Proceedings of the 17th international conference on World Wide Web
Enhancing text clustering by leveraging Wikipedia semantics

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Information Retrieval

Introduction to Information Retrieval
Personalizing Navigation in Folksonomies Using Hierarchical Tag Clustering

DaWaK '08 Proceedings of the 10th international conference on Data Warehousing and Knowledge Discovery
Personalized recommendation in social tagging systems using hierarchical clustering

Proceedings of the 2008 ACM conference on Recommender systems
Clustering the tagged web

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Exploiting Wikipedia as external knowledge for document clustering

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Feature generation for text categorization using world knowledge

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Exploit the tripartite network of social tagging for web clustering

Proceedings of the 18th ACM conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

As high quality descriptors of web page semantics, social annotations or tags have been used for web document clustering and achieved promising results. However, most web pages have few tags(less than 10). This sparsity seriously limits the usage of tags on clustering. In this work, we propose a user-related tag expansion method to overcome the problem, which incorporates additional useful tags into the original tag document by utilizing user tagging as background knowledge. Unfortunately, simply adding tags may cause topic drift, i.e., the dominant topic(s) of the original document may be changed. This problem is addressed in this research by designing a novel generative model called Folk-LDA, which jointly models original and expanded tags as independent observations. Experimental results show that (1)Our user-related tag expansion method can be effectively applied to over 90% tagged web documents; (2)Folk-LDA can alleviate the topic drift in expansion, especially for those topic-specific documents; (3) Compared to word-based clustering, our approach using only tags achieves a statistically significant increase of 39% on F1 score while reducing 76% terms involved in computation at best.