Feature selection strategy in text classification

Authors:
Pui Cheong Gabriel Fung;Fred Morstatter;Huan Liu
Affiliations:
Arizona State University, Tempe, AZ;Arizona State University, Tempe, AZ;Arizona State University, Tempe, AZ
Venue:
PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Year:
2011

Citing 15
Cited 1

An evaluation of phrasal and clustered representations on a text categorization task

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
A comparison of classifiers and document representations for the routing problem

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Combining classifiers in text categorization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Feature selection, perceptron learning, and a usability case study for text categorization

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Hierarchical neural networks for text categorization (poster abstract)

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
An improved boosting algorithm and its application to text categorization

Proceedings of the ninth international conference on Information and knowledge management
A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization

Text databases & document management
A meta-learning approach for text categorization

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization

ECDL '00 Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries
A study on optimal parameter tuning for Rocchio text classifier

ECIR'03 Proceedings of the 25th European conference on IR research

Toward an efficient and scalable feature selection approach for internet traffic classification

Computer Networks: The International Journal of Computer and Telecommunications Networking

Quantified Score

Hi-index	0.00

Visualization

Abstract

Traditionally, the best number of features is determined by the socalled "rule of thumb", or by using a separate validation dataset. We can neither find any explanation why these lead to the best number nor do we have any formal feature selection model to obtain this number. In this paper, we conduct an in-depth empirical analysis and argue that simply selecting the features with the highest scores may not be the best strategy. A highest scores approach will turn many documents into zero length, so that they cannot contribute to the training process. Accordingly, we formulate the feature selection process as a dual objective optimization problem, and identify the best number of features for each document automatically. Extensive experiments are conducted to verify our claims. The encouraging results indicate our proposed framework is effective.