Using a Wikipedia-based semantic relatedness measure for document clustering

  • Authors:
  • Majid Yazdani;Andrei Popescu-Belis

  • Affiliations:
  • Idiap Research Institute and EPFL, Rue Marconi, Martigny, Switzerland;Idiap Research Institute, Rue Marconi, Martigny, Switzerland

  • Venue:
  • TextGraphs-6 Proceedings of TextGraphs-6: Graph-based Methods for Natural Language Processing
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

A graph-based distance between Wikipedia articles is defined using a random walk model, which estimates visiting probability (VP) between articles using two types of links: hyperlinks and lexical similarity relations. The VP to and from a set of articles is then computed, and approximations are proposed to make tractable the computation of semantic relatedness between every two texts in a large data set. The model is applied to document clustering on the 20 Newsgroups data set. Precision and recall are improved in comparison with previous textual distance algorithms.