Clustering Documents Using a Wikipedia-Based Concept Representation

Authors:
Anna Huang;David Milne;Eibe Frank;Ian H. Witten
Affiliations:
Department of Computer Science, University of Waikato, New Zealand;Department of Computer Science, University of Waikato, New Zealand;Department of Computer Science, University of Waikato, New Zealand;Department of Computer Science, University of Waikato, New Zealand
Venue:
PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Year:
2009

Citing 11
Cited 11

Information Retrieval

Information Retrieval
Discriminative Features for Document Classification

ICPR '02 Proceedings of the 16 th International Conference on Pattern Recognition (ICPR'02) Volume 1 - Volume 1
Ontologies Improve Text Document Clustering

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Clustering short texts using wikipedia

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
A new unsupervised method for document clustering by using WordNet lexical and conceptual relations

Information Retrieval
Wikipedia-Based Kernels for Text Categorization

SYNASC '07 Proceedings of the Ninth International Symposium on Symbolic and Numeric Algorithms for Scientific Computing
Enhancing text clustering by leveraging Wikipedia semantics

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Improving Text Classification by Using Encyclopedia Knowledge

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Learning to link with wikipedia

Proceedings of the 17th ACM conference on Information and knowledge management
Clustering Documents with Active Learning Using Wikipedia

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2

Semantics-based representation model for multi-layer text classification

KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part II
Linking topics of news and blogs with wikipedia for complementary navigation

BlogTalk'08/09 Proceedings of the 2008/2009 international conference on Social software: recent trends and developments in social software
Unsupervised feature weighting based on local feature relatedness

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
High-order co-clustering text data on semantics-based representation model

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
A multi-layer text classification framework based on two-level representation model

Expert Systems with Applications: An International Journal
Text clustering based on granular computing and wikipedia

RSKT'11 Proceedings of the 6th international conference on Rough sets and knowledge technology
Topical clustering of search results

Proceedings of the fifth ACM international conference on Web search and data mining
Correlation based multi-document summarization for scientific articles and news group

Proceedings of the International Conference on Advances in Computing, Communications and Informatics
Selecting keywords to represent web pages using Wikipedia information

Proceedings of the 18th Brazilian symposium on Multimedia and the web
An open-source toolkit for mining Wikipedia

Artificial Intelligence
DIKEA: domain-independent keyphrase extraction algorithm

AI'12 Proceedings of the 25th Australasian joint conference on Advances in Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper shows how Wikipedia and the semantic knowledge it contains can be exploited for document clustering. We first create a concept-based document representation by mapping the terms and phrases within documents to their corresponding articles (or concepts) in Wikipedia. We also developed a similarity measure that evaluates the semantic relatedness between concept sets for two documents. We test the concept-based representation and the similarity measure on two standard text document datasets. Empirical results show that although further optimizations could be performed, our approach already improves upon related techniques.