A measure based on optimal matching in graph theory for document similarity

  • Authors:
  • Xiaojun Wan;Yuxin Peng

  • Affiliations:
  • Institute of Computer Science and Technology, Peking University, Beijing, China;Institute of Computer Science and Technology, Peking University, Beijing, China

  • Venue:
  • AIRS'04 Proceedings of the 2004 international conference on Asian Information Retrieval Technology
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Measuring pairwise document similarity is critical to various text retrieval and mining tasks. The most popular measure for document similarity is the Cosine measure in Vector Space Model. In this paper, we propose a new similarity measure based on optimal matching in graph theory. The proposed measure takes into account the structural information of a document by considering the word distributions over different text segments. It first calculates the similarities for different pairs of text segments in the documents and then gets the total similarity between the documents optimally through optimal matching. We set up experiments of document similarity search to test the effectiveness of the proposed measure. The experimental results and user study demonstrate that the proposed measure outperforms the most popular Cosine measure.