Document similarity based on concept tree distance

  • Authors:
  • Praveen Lakkaraju;Susan Gauch;Mirco Speretta

  • Affiliations:
  • University of Kansas, Lawrence, KS, USA;University of Arkansas, Fayetteville, AR, USA;University of Arkansas, Fayetteville, AR, USA

  • Venue:
  • Proceedings of the nineteenth ACM conference on Hypertext and hypermedia
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

The Web is quickly moving from the era of search engines to the era of discovery engines. Whereas search engines help you find information you are looking for, discovery engines help you find things that you never knew existed. A common discovery technique is to automatically identify and display objects similar to ones previously viewed by the user. Core to this approach is an accurate method to identify similar documents. In this paper, we present a new approach to identifying similar documents based on a conceptual tree-similarity measure. We represent each document as a concept tree using the concept associations obtained from a classifier. Then, we make employ a tree-similarity measure based on a tree edit distance to compute similarities between concept trees. Experiments on documents from the CiteSeer collection showed that our algorithm performed significantly better than document similarity based on the traditional vector space model.