Capturing expression using linguistic information

  • Authors:
  • Özlem Uzuner;Boris Katz

  • Affiliations:
  • Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA;Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA

  • Venue:
  • AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 3
  • Year:
  • 2005

Quantified Score

Hi-index 0.01

Visualization

Abstract

Recognizing similarities between literary works for copyright infringement detection requires evaluating similarity in the expression of content. Copyright law protects expression of content; similarities in content alone are not enough to indicate infringement. Expression refers to the way people convey particular information; it captures both the information and the manner of its presentation. In this paper, we present a novel set of linguistically informed features that provide a computational definition of expression and that enable accurate recognition of individual titles and their paraphrases more than 80% of the time. In comparison, baseline features, e.g., tfidf-weighted keywords, function words, etc., give an accuracy of at most 53%. Our computational definition of expression uses linguistic features that are extracted from POS-tagged text using context-free grammars, without incurring the computational cost of full parsers. The results indicate that informative linguistic features do not have to be computationally prohibitively expensive to extract.