Feature selection strategies for text categorization

Authors:
Pascal Soucy;Guy W. Mineau
Affiliations:
Copernic Research, Copernic Inc., Québec, Canada and Department of Computer Science, Université Laval, Québec, Canada;Department of Computer Science, Université Laval, Québec, Canada
Venue:
AI'03 Proceedings of the 16th Canadian society for computational studies of intelligence conference on Advances in artificial intelligence
Year:
2003

Citing 4
Cited 6

Training algorithms for linear text classifiers

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Feature Engineering for Text Classification

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning

A study of local and global thresholding techniques in text categorization

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Text classification: a recent overview

ICCOMP'05 Proceedings of the 9th WSEAS International Conference on Computers
Using Intuitionistic Fuzzy Sets in Text Categorization

ICAISC '08 Proceedings of the 9th international conference on Artificial Intelligence and Soft Computing
A class-feature-centroid classifier for text categorization

Proceedings of the 18th international conference on World wide web
Efficient Text Classification Using Best Feature Selection and Combination of Methods

Proceedings of the Symposium on Human Interface 2009 on ConferenceUniversal Access in Human-Computer Interaction. Part I: Held as Part of HCI International 2009
Fast dimension reduction for document classification based on imprecise spectrum analysis

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Feature selection is an important research issue in text categorization. The reason for this is that thousands of features are often involved, even when the simplest document representation model, the so-called bag-of-words, is used. Among the many approaches to feature selection, the use of some scoring function to rank features to filter them out is an important one. Many of these functions are widely used in text categorization. In past feature selection studies, most researchers have focused on comparing these measures in terms of accuracy achieved. For any measure, however, there are many selection strategies that can be applied to produce the resulting feature set. In this paper, we compare some such strategies and propose a new one. Tests have been conducted to compare five selection strategies on four datasets, using three distinct classifiers and four common feature scoring functions. As a result, it is possible to better understand which strategies are suited to particular classification settings.