Wikipedia-based smoothing for enhancing text clustering

Authors:
Elahe Rahimtoroghi;Azadeh Shakery
Affiliations:
School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran;School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran
Venue:
AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
Year:
2011

Citing 13
Cited 0

A study of smoothing methods for language models applied to Ad Hoc information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
PageRank without hyperlinks: structural re-ranking using links induced by language models

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Generative model-based document clustering: a comparative study

Knowledge and Information Systems
Clustering short texts using wikipedia

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Enhancing text clustering by leveraging Wikipedia semantics

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Improving Text Classification by Using Encyclopedia Knowledge

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Statistical Language Models for Information Retrieval A Critical Review

Foundations and Trends in Information Retrieval
Exploiting Wikipedia as external knowledge for document clustering

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Query dependent pseudo-relevance feedback based on wikipedia

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Enhancing cluster labeling using wikipedia

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
WordNet-based text document clustering

ROMAND '04 Proceedings of the 3rd Workshop on RObust Methods in Analysis of Natural Language Data
Feature generation for text categorization using world knowledge

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

The conventional algorithms for text clustering that are based on the bag of words model, fail to fully capture the semantic relations between the words. As a result, documents describing an identical topic may not be categorized into same clusters if they use different sets of words. A generic solution for this issue is to utilize background knowledge to enrich the document contents. In this research, we adopt a language modeling approach for text clustering and propose to smooth the document language models using Wikipedia articles in order to enhance text clustering performance. The contents of Wikipedia articles as well as their assigned categories are used in three different ways to smooth the document language models with the goal of enriching the document contents. Clustering is then performed on a document similarity graph constructed on the enhanced document collection. Experiment results confirm the effectiveness of the proposed methods.