NLP-driven constructive learning for filtering an IR document stream

Authors:
João Marcelo Azevedo Arcoverde;Maria Das Graças Volpe Nunes
Affiliations:
Departamento de Ciências de Computação, Instituto de Ciências Matemáticas e de Computação, Universidade de Sã Paulo, São Carlos, SP, Brasil;Departamento de Ciências de Computação, Instituto de Ciências Matemáticas e de Computação, Universidade de Sã Paulo, São Carlos, SP, Brasil
Venue:
CLEF'06 Proceedings of the 7th international conference on Cross-Language Evaluation Forum: evaluation of multilingual and multi-modal information retrieval
Year:
2006

Citing 7
Cited 0

Inference networks for document retrieval

Inference networks for document retrieval
Towards language independent automated learning of text categorization models

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging

Computational Linguistics
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Data-Driven Constructive Induction

IEEE Intelligent Systems
Editorial: special issue on learning from imbalanced data sets

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Extreme re-balancing for SVMs: a case study

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets

Quantified Score

Hi-index	0.00

Visualization

Abstract

Feature engineering is known as one of the most important challenges for knowledge acquisition, since any inductive learning system depends upon an efficient representation model to find good solutions to a given problem. We present an NLP-driven constructive learning method for building features based upon noun phrases structures, which are supposed to carry the highest discriminatory information. The method was test at the CLEF 2006 Ad-Hoc, monolingual (Portuguese) IR track. A classification model was obtained using this representation scheme over a small subset of the relevance judgments to filter false-positives documents returned by the IR-system. The goal was to increase the overall precision. The experiment achieved a MAP gain of 41.3%, in average, over three selected topics. The best F1-measure for the text classification task over the proposed text representation model was 77.1%. The results suggest that relevant linguistic features can be exploited by NLP techniques in a domain specific application, and can be used suscesfully in text categorization, which can act as an important coadjuvant process for other high-level IR tasks.