Phrase-based document categorization revisited

Authors:
Cornelis H.A. Koster;Jean G. Beney
Affiliations:
University of Nijmegen, Nijmegen, Netherlands;INSA de Lyon, Lyon, France
Venue:
Proceedings of the 2nd international workshop on Patent information retrieval
Year:
2009

Citing 11
Cited 2

An evaluation of phrasal and clustered representations on a text categorization task

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Investigating aboutness axioms using information fields

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Natural language information retrieval

TREC-2 Proceedings of the second conference on Text retrieval conference
Context-sensitive learning methods for text categorization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
A study of aboutness in information retrieval

Artificial Intelligence Review
A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization

Text databases & document management
Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm

Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
The AGFL Grammar Work Lab

Proceedings of the FREENIX Track: 2002 USENIX Annual Technical Conference
Taming wild phrases

ECIR'03 Proceedings of the 25th European conference on IR research
On the importance of parameter tuning in text categorization

PSI'06 Proceedings of the 6th international Andrei Ershov memorial conference on Perspectives of systems informatics

Genre and domain in patent texts

PaIR '10 Proceedings of the 3rd international workshop on Patent information retrieval
Query phrase expansion using wikipedia in patent class search

AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper takes a fresh look at an old idea in Information Retrieval: the use of linguistically extracted phrases as terms in the automatic categorization (aka classification) of documents. Until now, there was found little or no evidence that document categorization benefits from the application of linguistics techniques. Classification algorithms using the most cleverly designed linguistical representations typically do no better than those using simply the bag-of-words representation. Shallow linguistical techniques are used routinely, but their positive effect on the accuracy is small at best. We have investigated the use of dependency triples as terms in document categorization, which are derived according to a dependency model based on the notion of aboutness. The documents are syntactically analyzed by a parser and transduced to dependency trees, which in turn are unnested into dependency triples following the aboutness-based model. In the process, various normalizing transformations are applied to enhance recall. We describe a sequence of large-scale experiments with different document representations, test collections and even languages, presenting evidence that adding such triples to the words in a bag-of-terms document representation may lead to a significant increase in the accuracy of document categorization.