An evaluation of phrasal and clustered representations on a text categorization task
SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Overview of the second text retrieval conference (TREC-2)
TREC-2 Proceedings of the second conference on Text retrieval conference
An Evaluation of Statistical Approaches to Text Categorization
Information Retrieval
A study of thresholding strategies for text categorization
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
The use of bigrams to enhance text categorization
Information Processing and Management: an International Journal
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Using register-diversified corpora for general language studies
Computational Linguistics - Special issue on using large corpora: II
Named entity extraction with conditional Markov models and classifiers
COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Relevant estimation among fields using field association words
International Journal of Computer Applications in Technology
AI'07 Proceedings of the 20th Australian joint conference on Advances in artificial intelligence
Exploiting probabilistic topic models to improve text categorization under class imbalance
Information Processing and Management: an International Journal
Journal of Biomedical Informatics
Hi-index | 0.00 |
We present a simple approach for the classification of "noisy" documents using bigrams and named entities. The approach combines conventional feature selection with a contextual approach to filter out passages around selected features. Originally designed for call for tender documents, the method can be useful for other web collections that also contain non-topical contents. Experiments are conducted on our in-house collection as well as on the 4-Universities data set, Reuters 21578 and 20 Newsgroups. We find a significant improvement on our collection and the 4-Universities data set (10.9% and 4.1%, respectively). Although the best results are obtained by combining bigrams and named entities, the impact of the latter is not found to be significant.