Filtering contents with bigrams and named entities to improve text classification

Authors:
François Paradis;Jian-Yun Nie
Affiliations:
Université de Montréal, Canada;Université de Montréal, Canada
Venue:
AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
Year:
2005

Citing 7
Cited 1

An evaluation of phrasal and clustered representations on a text categorization task

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
A study of thresholding strategies for text categorization

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
The use of bigrams to enhance text categorization

Information Processing and Management: an International Journal
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Using register-diversified corpora for general language studies

Computational Linguistics - Special issue on using large corpora: II
Named entity extraction with conditional Markov models and classifiers

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20

Identifying protein interaction abstracts with contextual bag of words

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a new method for the classification of “noisy” documents, based on filtering contents with bigrams and named entities. The method is applied to call for tender documents, but we claim it would be useful for many other Web collections, which also contain non-topical contents. Different variations of the method are discussed. We obtain the best results by filtering out a window around the least relevant bigrams. We find a significant increase of the micro-F1 measure on our collection of call for tenders, as well as on the “4-Universities” collection. Another approach, to reject sentences based on the presence of some named entities, also shows a moderate increase. Finally, we try combining the two approaches, but do not get conclusive results so far.