Exploiting Wikipedia as external knowledge for document clustering

Authors:
Xiaohua Hu;Xiaodan Zhang;Caimei Lu;E. K. Park;Xiaohua Zhou
Affiliations:
Drexel University, Philadelphia, PA, USA;Drexel University, Philadelphia, PA, USA;Drexel University, Philadelphia, PA, USA;University of Missouri at Kansas City, Kansas City, MO, USA;Drexel University, Philadelphia, PA, USA
Venue:
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2009

Citing 11
Cited 49

Text Clustering Based on Good Aggregations

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Generative model-based document clustering: a comparative study

Knowledge and Information Systems
Integration of semantic-based bipartite graph representation and mutual refinement strategy for biomedical literature clustering

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Clustering short texts using wikipedia

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Improving weak ad-hoc queries using wikipedia asexternal corpus

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Learning to classify short and sparse text & web with hidden topics from large-scale data collections

Proceedings of the 17th international conference on World Wide Web
Enhancing text clustering by leveraging Wikipedia semantics

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Building semantic kernels for text classification using wikipedia

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Computing semantic relatedness using Wikipedia-based explicit semantic analysis

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
A comparative study of ontology based term similarity measures on PubMed document clustering

DASFAA'07 Proceedings of the 12th international conference on Database systems for advanced applications

A probabilistic topic-connection model for automatic image annotation

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Using Wikipedia categories for compact representations of chemical documents

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Semantics-based representation model for multi-layer text classification

KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part II
Frequent itemset based hierarchical document clustering using Wikipedia as external knowledge

KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part II
Linking topics of news and blogs with wikipedia for complementary navigation

BlogTalk'08/09 Proceedings of the 2008/2009 international conference on Social software: recent trends and developments in social software
Annotate Wikipedia with Flickr images: concepts and case study

ICIMCS '10 Proceedings of the Second International Conference on Internet Multimedia Computing and Service
Hierarchical topic-based communities construction for authors in a literature database

IEA/AIE'10 Proceedings of the 23rd international conference on Industrial engineering and other applications of applied intelligent systems - Volume Part II
Document clustering using NMF and fuzzy relation

Proceedings of the 5th International Conference on Ubiquitous Information Management and Communication
User-related tag expansion for web document clustering

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
A generalized method for word sense disambiguation based on wikipedia

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Unsupervised feature weighting based on local feature relatedness

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
High-order co-clustering text data on semantics-based representation model

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Multilingual document clustering using wikipedia as external knowledge

IRFC'11 Proceedings of the Second international conference on Multidisciplinary information retrieval facility
Effectively mining wikipedia for clustering multilingual documents

NLDB'11 Proceedings of the 16th international conference on Natural language processing and information systems
Improving document clustering using Okapi BM25 feature weighting

Information Retrieval
A multi-layer text classification framework based on two-level representation model

Expert Systems with Applications: An International Journal
Text clustering based on granular computing and wikipedia

RSKT'11 Proceedings of the 6th international conference on Rough sets and knowledge technology
Transferring topical knowledge from auxiliary long texts for short text clustering

Proceedings of the 20th ACM international conference on Information and knowledge management
Large-scale question classification in cQA by leveraging Wikipedia semantic knowledge

Proceedings of the 20th ACM international conference on Information and knowledge management
Leveraging Wikipedia concept and category information to enhance contextual advertising

Proceedings of the 20th ACM international conference on Information and knowledge management
Representing document as dependency graph for document clustering

Proceedings of the 20th ACM international conference on Information and knowledge management
PDFMeat: managing publications on the semantic desktop

Proceedings of the 20th ACM international conference on Information and knowledge management
Enriching short text representation in microblog for clustering

Frontiers of Computer Science in China
Efficient semantic kernel-based text classification using matching pursuit KFDA

ICONIP'11 Proceedings of the 18th international conference on Neural Information Processing - Volume Part II
Mining wikipedia and yahoo! answers for question expansion in opinion QA

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
A web 2.0 approach for organizing search results using wikipedia

AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
Wikipedia-based smoothing for enhancing text clustering

AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
Short text classification improved by learning multi-granularity topics

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three
Short text conceptualization using a probabilistic knowledgebase

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three
Unsupervised multi-label text classification using a world knowledge ontology

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Clustering and understanding documents via discrimination information maximization

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
CluChunk: clustering large scale user-generated content incorporating chunklet information

Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
Selecting keywords to represent web pages using Wikipedia information

Proceedings of the 18th Brazilian symposium on Multimedia and the web
Sentence clustering via projection over term clusters

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
Modeling semantic relations between visual attributes and object categories via dirichlet forest prior

Proceedings of the 21st ACM international conference on Information and knowledge management
On the connections between explicit semantic analysis and latent semantic analysis

Proceedings of the 21st ACM international conference on Information and knowledge management
Collaboratively built semi-structured content and Artificial Intelligence: The story so far

Artificial Intelligence
Computing text semantic relatedness using the contents and links of a hypertext encyclopedia

Artificial Intelligence
Wiki3C: exploiting wikipedia for context-aware concept categorization

Proceedings of the sixth ACM international conference on Web search and data mining
A document is known by the company it keeps: neighborhood consensus for short text categorization

Language Resources and Evaluation
Semantic Labelling for Document Feature Patterns Using Ontological Subjects

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Effective measures for inter-document similarity

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Short text classification by detecting information path

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Improving semi-supervised text classification by using wikipedia knowledge

WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management
Mapping semantic knowledge for unsupervised text categorisation

ADC '13 Proceedings of the Twenty-Fourth Australasian Database Conference - Volume 137
Improving question retrieval in community question answering using world knowledge

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Semantic smoothing for text clustering

Knowledge-Based Systems
Deflation-based power iteration clustering

Applied Intelligence
WHAD: Wikipedia historical attributes data

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

In traditional text clustering methods, documents are represented as "bags of words" without considering the semantic information of each document. For instance, if two documents use different collections of core words to represent the same topic, they may be falsely assigned to different clusters due to the lack of shared core words, although the core words they use are probably synonyms or semantically associated in other forms. The most common way to solve this problem is to enrich document representation with the background knowledge in an ontology. There are two major issues for this approach: (1) the coverage of the ontology is limited, even for WordNet or Mesh, (2) using ontology terms as replacement or additional features may cause information loss, or introduce noise. In this paper, we present a novel text clustering method to address these two issues by enriching document representation with Wikipedia concept and category information. We develop two approaches, exact match and relatedness-match, to map text documents to Wikipedia concepts, and further to Wikipedia categories. Then the text documents are clustered based on a similarity metric which combines document content information, concept information as well as category information. The experimental results using the proposed clustering framework on three datasets (20-newsgroup, TDT2, and LA Times) show that clustering performance improves significantly by enriching document representation with Wikipedia concepts and categories.