Contextual feature selection for text classification

Authors:
Francois Paradis;Jian-Yun Nie
Affiliations:
DIRO, Université de Montréal, Montreal, Que., Canada;DIRO, Université de Montréal, Montreal, Que., Canada
Venue:
Information Processing and Management: an International Journal - Special issue: AIRS2005: Information retrieval research in Asia
Year:
2007

Citing 8
Cited 4

An evaluation of phrasal and clustered representations on a text categorization task

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Overview of the second text retrieval conference (TREC-2)

TREC-2 Proceedings of the second conference on Text retrieval conference
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
A study of thresholding strategies for text categorization

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
The use of bigrams to enhance text categorization

Information Processing and Management: an International Journal
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Using register-diversified corpora for general language studies

Computational Linguistics - Special issue on using large corpora: II
Named entity extraction with conditional Markov models and classifiers

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20

Relevant estimation among fields using field association words

International Journal of Computer Applications in Technology
Effectiveness of methods for syntactic and semantic recognition of numeral strings: tradeoffs between number of features and length of word N-grams

AI'07 Proceedings of the 20th Australian joint conference on Advances in artificial intelligence
Exploiting probabilistic topic models to improve text categorization under class imbalance

Information Processing and Management: an International Journal
Mining association language patterns using a distributional semantic model for negative life event classification

Journal of Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a simple approach for the classification of "noisy" documents using bigrams and named entities. The approach combines conventional feature selection with a contextual approach to filter out passages around selected features. Originally designed for call for tender documents, the method can be useful for other web collections that also contain non-topical contents. Experiments are conducted on our in-house collection as well as on the 4-Universities data set, Reuters 21578 and 20 Newsgroups. We find a significant improvement on our collection and the 4-Universities data set (10.9% and 4.1%, respectively). Although the best results are obtained by combining bigrams and named entities, the impact of the latter is not found to be significant.