Comparison of metrics for feature selection in imbalanced text classification

Authors:
Hiroshi Ogura;Hiromi Amano;Masato Kondo
Affiliations:
Department of Information Science, Faculty of Arts and Sciences at Fujiyoshida, Showa University, 4562, Kamiyoshida, Fujiyoshida-City, Yamanashi 403-0005, Japan;Department of Information Science, Faculty of Arts and Sciences at Fujiyoshida, Showa University, 4562, Kamiyoshida, Fujiyoshida-City, Yamanashi 403-0005, Japan;Department of Information Science, Faculty of Arts and Sciences at Fujiyoshida, Showa University, 4562, Kamiyoshida, Fujiyoshida-City, Yamanashi 403-0005, Japan
Venue:
Expert Systems with Applications: An International Journal
Year:
2011

Citing 15
Cited 4

Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms

Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Feature Selection for Unbalanced Class Distribution and Naive Bayes

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Editorial: special issue on learning from imbalanced data sets

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Feature selection for text categorization on imbalanced data

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
An adaptive k-nearest neighbor text categorization strategy

ACM Transactions on Asian Language Information Processing (TALIP)
Bias Analysis in Text Classification for Highly Skewed Data

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
A novel feature selection algorithm for text categorization

Expert Systems with Applications: An International Journal
Imbalanced text classification: A term weighting approach

Expert Systems with Applications: An International Journal
Feature selection with a measure of deviations from Poisson in text categorization

Expert Systems with Applications: An International Journal
Learning when training data are costly: the effect of class distribution on tree induction

Journal of Artificial Intelligence Research
The foundations of cost-sensitive learning

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
Distinctive characteristics of a metric using deviations from Poisson for feature selection

Expert Systems with Applications: An International Journal

Incorporating game theory in feature selection for text categorization

RSFDGrC'11 Proceedings of the 13th international conference on Rough sets, fuzzy sets, data mining and granular computing
Feature evaluation and selection with cooperative game theory

Pattern Recognition
Class-indexing-based term weighting for automatic text classification

Information Sciences: an International Journal
Comparison of text feature selection policies and using an adaptive framework

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	12.05

Visualization

Abstract

Abstract: Class imbalance problems are often encountered in real applications of automatic text classifications especially at the so-called ''one-against-all'' settings and thus handling the problem with satisfactory performance is substantially important. In this paper, we focus our attention on a feature selection scheme for solving this problem and explore the abilities and characteristics of various metrics for feature selection. We examine three different types of metrics; Type-I: @g"P^2 and Gini index, Type-II: @g^2 and information gain and Type-III: signed @g^2 and signed information gain. Type-I and Type-II metrics implicitly combine positive and negative features which indicate the membership and nonmembership of positive class, respectively. Type-III metrics were utilized in the combination framework in which the positive and negative features are explicitly combined and the degree of combination is optimized to improve the performance at imbalanced situations. Our experimental results show that feature selections using Type-I metrics on imbalanced data set achieve the comparable classification performances with those of the combination framework using Type-III metrics and proved to be much more superior to those of Type-II metrics. This result indicates that Type-I metrics serve as more simplified alternative methods for the combination framework. The characteristic behaviors and the performance of each of the used metrics are also investigated closely in terms of the distribution and quality of selected features.