Oscillating feature subset search algorithm for text categorization

Authors:
Jana Novovičová;Petr Somol;Pavel Pudil
Affiliations:
Dept. of Pattern Recognition, Institute of Information Theory and Automation, Academy of Sciences of the, Czech Republic;Dept. of Pattern Recognition, Institute of Information Theory and Automation, Academy of Sciences of the, Czech Republic;Dept. of Pattern Recognition, Institute of Information Theory and Automation, Academy of Sciences of the, Czech Republic
Venue:
CIARP'06 Proceedings of the 11th Iberoamerican conference on Progress in Pattern Recognition, Image Analysis and Applications
Year:
2006

Citing 15
Cited 2

Automated learning of decision rules for text categorization

ACM Transactions on Information Systems (TOIS)
An example-based mapping method for text categorization and retrieval

ACM Transactions on Information Systems (TOIS)
Feature selection, perceptron learning, and a usability case study for text categorization

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Context-sensitive learning methods for text categorization

ACM Transactions on Information Systems (TOIS)
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Statistical Pattern Recognition: A Review

IEEE Transactions on Pattern Analysis and Machine Intelligence
BoosTexter: A Boosting-based Systemfor Text Categorization

Machine Learning - Special issue on information retrieval
A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization

Text databases & document management
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Feature Subset Selection in Text-Learning

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Transductive Inference for Text Classification using Support Vector Machines

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Feature Selection for Unbalanced Class Distribution and Naive Bayes

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)

Integrating a Feature Selection Algorithm for Classification of Voltage Sags Originated in Transmission and Distribution Networks

Proceedings of the 2007 conference on Artificial Intelligence Research and Development
Using typical testors for feature selection in text categorization

CIARP'07 Proceedings of the Congress on pattern recognition 12th Iberoamerican conference on Progress in pattern recognition, image analysis and applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

A major characteristic of text document categorization problems is the extremely high dimensionality of text data. In this paper we explore the usability of the Oscillating Search algorithm for feature/word selection in text categorization. We propose to use the multiclass Bhattacharyya distance for multinomial model as the global feature subset selection criterion for reducing the dimensionality of the bag of words vector document representation. This criterion takes into consideration inter-feature relationships. We experimentally compare three subset selection procedures: the commonly used best individual feature selection based on information gain, the same based on individual Bhattacharyya distance, and the Oscillating Search to maximize Bhattacharyya distance on groups of features. The obtained feature subsets are then tested on the standard Reuters data with two classifiers: the multinomial Bayes and the linear SVM. The presented experimental results illustrate that using a non-trivial feature selection algorithm is not only computationally feasible, but it also brings substantial improvement in classification accuracy over traditional, individual feature evaluation based methods.