Using Empirical Methods for Evaluating Expression and Content Similarity

  • Authors:
  • Ozlem Uzuner;Randall Davis;Boris Katz

  • Affiliations:
  • -;-;-

  • Venue:
  • HICSS '04 Proceedings of the Proceedings of the 37th Annual Hawaii International Conference on System Sciences (HICSS'04) - Track 4 - Volume 4
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Despite lack of any significant quantifiable similarities between documents, people canintuitively compare documents and evaluate their similarity. To understand how people evaluate text similarity, we queried subjects about the level of content similarity and expression similarity of pairs of documents. Using these judgments on similarity as ground truth, we automated evaluation of similarity.Our main application for automatic evaluation of text similarity is copyright infringement detection. United States copyright law protects expression but not any underlying facts and ideas being expressed. Therefore, we focus on recognizing similarity of expression. We envision a scenario where authors present the system with a document and the system replies with documents that share the same expressive characteristics.We hypothesize that, since content and expression are not independent of each other, accurate recognition of expression similarity will also help recognition of content similarity.The experiments presented in this paper evaluate two sets of features, unigrams and style features, with respect to their ability to recognize similarities in content and expression of documents using the ground truth obtained from user experiments. Our results show that, on our data set of short news articles, stylistic features predict similarity of expression more accurately than tf*idf weighted unigrams. While unigrams can identify high-level content similarities between documents about the same people, topic and events, they are less effective than style features in evaluating finer grained content similarities.