Clustering Documents Using a Wikipedia-Based Concept Representation

  • Authors:
  • Anna Huang;David Milne;Eibe Frank;Ian H. Witten

  • Affiliations:
  • Department of Computer Science, University of Waikato, New Zealand;Department of Computer Science, University of Waikato, New Zealand;Department of Computer Science, University of Waikato, New Zealand;Department of Computer Science, University of Waikato, New Zealand

  • Venue:
  • PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper shows how Wikipedia and the semantic knowledge it contains can be exploited for document clustering. We first create a concept-based document representation by mapping the terms and phrases within documents to their corresponding articles (or concepts) in Wikipedia. We also developed a similarity measure that evaluates the semantic relatedness between concept sets for two documents. We test the concept-based representation and the similarity measure on two standard text document datasets. Empirical results show that although further optimizations could be performed, our approach already improves upon related techniques.