Phrase-based document categorization revisited

  • Authors:
  • Cornelis H.A. Koster;Jean G. Beney

  • Affiliations:
  • University of Nijmegen, Nijmegen, Netherlands;INSA de Lyon, Lyon, France

  • Venue:
  • Proceedings of the 2nd international workshop on Patent information retrieval
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper takes a fresh look at an old idea in Information Retrieval: the use of linguistically extracted phrases as terms in the automatic categorization (aka classification) of documents. Until now, there was found little or no evidence that document categorization benefits from the application of linguistics techniques. Classification algorithms using the most cleverly designed linguistical representations typically do no better than those using simply the bag-of-words representation. Shallow linguistical techniques are used routinely, but their positive effect on the accuracy is small at best. We have investigated the use of dependency triples as terms in document categorization, which are derived according to a dependency model based on the notion of aboutness. The documents are syntactically analyzed by a parser and transduced to dependency trees, which in turn are unnested into dependency triples following the aboutness-based model. In the process, various normalizing transformations are applied to enhance recall. We describe a sequence of large-scale experiments with different document representations, test collections and even languages, presenting evidence that adding such triples to the words in a bag-of-terms document representation may lead to a significant increase in the accuracy of document categorization.