Probabilistic reasoning in intelligent systems: networks of plausible inference
Probabilistic reasoning in intelligent systems: networks of plausible inference
Term-weighting approaches in automatic text retrieval
Information Processing and Management: an International Journal
Models for retrieval with probabilistic indexing
Information Processing and Management: an International Journal - Modeling data, information and knowledge
SCISOR: extracting information from on-line news
Communications of the ACM
Using syntactic analysis in a document retrieval system that uses signature files
SIGIR '90 Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating text categorization
HLT '91 Proceedings of the workshop on Speech and Natural Language
An evaluation of phrasal and clustered representations on a text categorization task
SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Representation and learning in information retrieval
Representation and learning in information retrieval
Towards language independent automated learning of text categorization models
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
A stochastic parts program and noun phrase parser for unrestricted text
ANLC '88 Proceedings of the second conference on Applied natural language processing
Noun classification from predicate-argument structures
ACL '90 Proceedings of the 28th annual meeting on Association for Computational Linguistics
Parsing, word associations and typical predicate-argument relations
HLT '89 Proceedings of the workshop on Speech and Natural Language
Feature selection and feature extraction for text categorization
HLT '91 Proceedings of the workshop on Speech and Natural Language
The importance of proper weighting methods
HLT '93 Proceedings of the workshop on Human Language Technology
Partial orders for document representation: a new methodology for combining document features
SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Mining Text Using Keyword Distributions
Journal of Intelligent Information Systems
Hi-index | 0.00 |
The use of NLP techniques for document classification has not produced significant improvements in performance within the standard term weighting statistical assignment paradigm (Fagan 1987; Lewis, 1992bc; Buckley, 1993). This perplexing fact needs both an explanation and a solution if the power of recently developed NLP techniques are to be successfully applied in IR. A novel method for adding linguistic annotation to corpora is presented which involves using a statistical POS tagger in conjunction with unsupervised structure finding methods to derive notions of "noun group", "verb group", and so on which is inherently extensible to more sophisticated annotation, and does not require a pre-tagged corpus to fit. One of the distinguishing features of a more linguistically sophisticated representation of documents over a word set based representation of them is that linguistically sophisticated units are more frequently individually good predictors of document descriptors (keywords) than single words are. This leads us to consider the assignment of descriptors from individual phrases rather than from the weighted sum of a word set representation. We investigate how sets of individually high-precision rules can result in a low precision when used together, and develop some theory about these probably-correct rules. We then proceed to repeat results which show that standard statistical models are not particularly suitable for exploiting linguistically sophisticated representations, and show that a statistically fitted rule-based model provides significantly improved performance for sophisticated representations. It therefore shows that statistical systems can exploit sophisticated representations of documents, and lends some support to the use of more linguistically sophisticated representations for document classification. This paper reports on work done for the LRE project SISTA, which is creating a PC based tool to be used in the technical abstracting industry.