Enhancing text clustering by leveraging Wikipedia semantics

Authors:
Jian Hu;Lujun Fang;Yang Cao;Hua-Jun Zeng;Hua Li;Qiang Yang;Zheng Chen
Affiliations:
Microsoft Research Asia, Beijing, China;Fudan University, Shanghai, China;Shanghai Jiao Tong Univeristy, Shanghai, China;Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China;Hong Kong University of Science & Technology, Hong Kong, Hong Kong;Microsoft Research Asia, Beijing, China
Venue:
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2008

Citing 14
Cited 57

OHSUMED: an interactive retrieval evaluation and new large test collection for research

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
WordNet: a lexical database for English

Communications of the ACM
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Mining the peanut gallery: opinion extraction and semantic classification of product reviews

WWW '03 Proceedings of the 12th international conference on World Wide Web
Chinese word segmentation based on maximum matching and word binding force

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Mining Domain-Specific Thesauri from Wikipedia: A Case Study

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Improving Text Classification by Using Encyclopedia Knowledge

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
WikiRelate! computing semantic relatedness using wikipedia

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Deriving a large scale taxonomy from Wikipedia

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Computing semantic relatedness using Wikipedia-based explicit semantic analysis

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Feature generation for text categorization using world knowledge

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence

Leveraging Web 2.0 Sources for Web Content Classification

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Understanding user's query intent with wikipedia

Proceedings of the 18th international conference on World wide web
Clustering Documents Using a Wikipedia-Based Concept Representation

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Exploiting Wikipedia as external knowledge for document clustering

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Web Search Clustering and Labeling with Hidden Topics

ACM Transactions on Asian Language Information Processing (TALIP)
Enhancing cluster labeling using wikipedia

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Improving text classification by a sense spectrum approach to term expansion

CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
Supervised semantic indexing

Proceedings of the 18th ACM conference on Information and knowledge management
Named entity disambiguation by leveraging wikipedia semantic knowledge

Proceedings of the 18th ACM conference on Information and knowledge management
Clustering web queries

Proceedings of the 18th ACM conference on Information and knowledge management
Exploiting internal and external semantics for the clustering of short texts using world knowledge

Proceedings of the 18th ACM conference on Information and knowledge management
ExSearch: a novel vertical search engine for online barter business

Proceedings of the 18th ACM conference on Information and knowledge management
Linking Wikipedia entries to blog feeds by machine learning

Proceedings of the 3rd International Universal Communication Symposium
Granular Computing for Text Mining: New Research Challenges and Opportunities

RSFDGrC '09 Proceedings of the 12th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing
A Kernel-based feature weighting for text classification

IJCNN'09 Proceedings of the 2009 international joint conference on Neural Networks
Wikipedia-assisted concept thesaurus for better web media understanding

Proceedings of the international conference on Multimedia information retrieval
Exploiting time-based synonyms in searching document archives

Proceedings of the 10th annual joint conference on Digital libraries
Learning to rank with (a lot of) word features

Information Retrieval
A probabilistic topic-connection model for automatic image annotation

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Semantics-based representation model for multi-layer text classification

KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part II
Frequent itemset based hierarchical document clustering using Wikipedia as external knowledge

KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part II
Linking topics of news and blogs with wikipedia for complementary navigation

BlogTalk'08/09 Proceedings of the 2008/2009 international conference on Social software: recent trends and developments in social software
Cross lingual text classification by mining multilingual topics from wikipedia

Proceedings of the fourth ACM international conference on Web search and data mining
User-related tag expansion for web document clustering

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Collective entity linking in web text: a graph-based method

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
High-order co-clustering text data on semantics-based representation model

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Grammatical dependency-based relations for term weighting in text classification

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Multilingual document clustering using wikipedia as external knowledge

IRFC'11 Proceedings of the Second international conference on Multidisciplinary information retrieval facility
Text classification for data loss prevention

PETS'11 Proceedings of the 11th international conference on Privacy enhancing technologies
Semantics-based web service discovery using information retrieval techniques

INEX'10 Proceedings of the 9th international conference on Initiative for the evaluation of XML retrieval: comparative evaluation of focused retrieval
Autonomous and adaptive identification of topics in unstructured text

KES'11 Proceedings of the 15th international conference on Knowledge-based and intelligent information and engineering systems - Volume Part II
A multi-layer text classification framework based on two-level representation model

Expert Systems with Applications: An International Journal
Text clustering based on granular computing and wikipedia

RSKT'11 Proceedings of the 6th international conference on Rough sets and knowledge technology
Large-scale question classification in cQA by leveraging Wikipedia semantic knowledge

Proceedings of the 20th ACM international conference on Information and knowledge management
Leveraging Wikipedia concept and category information to enhance contextual advertising

Proceedings of the 20th ACM international conference on Information and knowledge management
PDFMeat: managing publications on the semantic desktop

Proceedings of the 20th ACM international conference on Information and knowledge management
Advertising Keywords Recommendation for Short-Text Web Pages Using Wikipedia

ACM Transactions on Intelligent Systems and Technology (TIST)
Quality-aware similarity assessment for entity matching in Web data

Information Systems
Topical clustering of search results

Proceedings of the fifth ACM international conference on Web search and data mining
Enriching short text representation in microblog for clustering

Frontiers of Computer Science in China
Athena: text mining based discovery of scientific workflows in disperse repositories

RED'10 Proceedings of the Third international conference on Resource Discovery
A web 2.0 approach for organizing search results using wikipedia

AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
Wikipedia-based smoothing for enhancing text clustering

AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
LDA-Based topic modeling in labeling blog posts with wikipedia entries

APWeb'12 Proceedings of the 14th international conference on Web Technologies and Applications
Short text classification improved by learning multi-granularity topics

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three
Short text conceptualization using a probabilistic knowledgebase

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three
Learning a concept-based document similarity measure

Journal of the American Society for Information Science and Technology
CluChunk: clustering large scale user-generated content incorporating chunklet information

Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
Selecting keywords to represent web pages using Wikipedia information

Proceedings of the 18th Brazilian symposium on Multimedia and the web
Using maximal spanning trees and word similarity to generate hierarchical clusters of non-redundant RSS news articles

Journal of Intelligent Information Systems
Exploring the existing category hierarchy to automatically label the newly-arising topics in cQA

Proceedings of the 21st ACM international conference on Information and knowledge management
On the connections between explicit semantic analysis and latent semantic analysis

Proceedings of the 21st ACM international conference on Information and knowledge management
A new term ranking method based on relation extraction and graph model for text classification

ACSC '11 Proceedings of the Thirty-Fourth Australasian Computer Science Conference - Volume 113
Enhancing short text clustering with small external repositories

AusDM '11 Proceedings of the Ninth Australasian Data Mining Conference - Volume 121
Cross-media topic mining on wikipedia

Proceedings of the 21st ACM international conference on Multimedia
Improving semi-supervised text classification by using wikipedia knowledge

WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management
Improving question retrieval in community question answering using world knowledge

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most traditional text clustering methods are based on "bag of words" (BOW) representation based on frequency statistics in a set of documents. BOW, however, ignores the important information on the semantic relationships between key terms. To overcome this problem, several methods have been proposed to enrich text representation with external resource in the past, such as WordNet. However, many of these approaches suffer from some limitations: 1) WordNet has limited coverage and has a lack of effective word-sense disambiguation ability; 2) Most of the text representation enrichment strategies, which append or replace document terms with their hypernym and synonym, are overly simple. In this paper, to overcome these deficiencies, we first propose a way to build a concept thesaurus based on the semantic relations (synonym, hypernym, and associative relation) extracted from Wikipedia. Then, we develop a unified framework to leverage these semantic relations in order to enhance traditional content similarity measure for text clustering. The experimental results on Reuters and OHSUMED datasets show that with the help of Wikipedia thesaurus, the clustering performance of our method is improved as compared to previous methods. In addition, with the optimized weights for hypernym, synonym, and associative concepts that are tuned with the help of a few labeled data users provided, the clustering performance can be further improved.